Python framework spec-first and local-first for creating, experimenting with, and running LLM agents on-premise.
AgentForge is the production destination of a 4-stage model certification funnel.
19 models entered the funnel. Each stage was a gate — not a comparison, but an elimination filter:
| Stage | Gate question | Filter |
|---|---|---|
| ABS | Can it call tools at all? | 19 entered |
| LOP | Does it hold under real operational pressure? | — |
| FORGE | Can it function as an agent? Multi-turn, chained, autonomous? | 7 entered |
| REAL | Does it work in production? Real browser, real tests, no shortcuts? | 4 proven |
| agent-FORGE | Deploy | Runtime for models that earned it |
The models running in agent-FORGE are not "the best on a leaderboard" — they are the ones that proved they can do the actual job, end to end, on real hardware, with real tasks.
This shaped the framework's central thesis, validated empirically across 4 months of benchmarks:
20% is the model, 80% is the runtime.
The quality of a local agent depends less on the chosen model and more on how the runtime manages context, tool use decisions, guardrails, and evaluation. The framework implements these mechanisms in an explicit, testable, and spec-driven way.
- Key Features
- Requirements & Installation
- Basic Flow
- Execution Channels
- Tool Calling
- Tool Registry — Agents That Create Tools
- Active Guardrails
- Autonomous Reflection
- Evaluation & Scoring
- Project Structure
- CLI Reference
- Development & Testing
| Feature | What it means in practice |
|---|---|
| Spec-First | An agent is born from an agent.yaml. No behavior exists outside the spec. |
| Local-First | Optimized for Ollama. No external API dependencies in the critical path. |
| Tool Calling Model-Driven | The model decides when to call tools via the native OpenAI/Ollama protocol. |
| Loop Guard | Detects repeated tool calling cycles and stops before exhausting the budget. |
| Autonomous Reflection | N rounds of self-critique after the initial output — improves quality without human intervention. |
| Active Guardrails | Checks must_not on the output using the model itself as judge. Automatically retries if violated. |
| Eval with LLM Judge | Multidimensional scoring using local Ollama or Gemini as evaluator. |
| 4 Channels | CLI, HTTP (n8n), MCP (Claude Code/Desktop), Telegram — same spec, any interface. |
| Multi-Agent | Orchestrator delegates subtasks to workers declared in the YAML via run_agent. |
| Tool Registry | Agents create, test, and register new Python tools at runtime. Tools become permanently available to all agents. |
| 296 tests | Broad coverage ensuring runtime stability and API contracts. |
Requirements:
- Python 3.11+
- Ollama installed and running (
docker composeor local) - Recommended model:
qwen3.5:9b(simple tasks) orqwen3.5:27b(complex tasks)
AgentForge works with any Ollama model, but was empirically optimized for the qwen3.5 family based on 4 months of benchmarks (ABS → LOP → FORGE → REAL) covering 19 local models.
| Model | VRAM | Speed* | Recommended use |
|---|---|---|---|
qwen3.5:9b |
~7 GB | ~45 tok/s | Monitoring, orchestration, simple queries |
qwen3.5:27b |
~17 GB | ~25 tok/s | Coding with tests, multi-step analysis, documentation generation |
*Measured on test hardware (Xeon E5-2696v3 + dual RTX 3060 12GB).
Results on evaluation scenarios with qwen3.5:27b: FORGE F3 94.4%, REAL P4 91.7%.
Methodological details and selection criteria:
docs/MODEL-STRATEGY.md
git clone <repo-url>
cd agents-framework
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"Every agent follows three steps:
agentforge wizardGenerates agents/<id>/agent.yaml with persona, tools, memory, and guardrails.
agentforge generate --path agents/<id>/agent.yamlProduces in agents/<id>/:
system_prompt.md— structured system promptruntime.yaml— execution parameterstools.yaml— tool schemaeval.yaml— evaluation configurationREADME.md— agent technical documentation
agentforge run --agent-dir agents/lab-ops --input "How is the server?"Output in JSON (--mode raw) or human-readable text (--mode pretty).
The same agent runs on four channels without changing the spec:
agentforge run --agent-dir agents/lab-ops --input "text" --mode prettyagentforge serve --agent-dir agents/lab-ops --port 8080POST http://localhost:8080/run
Content-Type: application/json
{"input": "server status"}
GET http://localhost:8080/healthagentforge mcp --transport stdioConfigure in .mcp.json at the project root:
{
"mcpServers": {
"agentforge": {
"command": "/path/to/.venv/bin/agentforge",
"args": ["mcp"]
}
}
}Exposed tools: collect_system_health, read_log_tail, scan_directory, run_agent.
export TELEGRAM_BOT_TOKEN="123456:ABC..."
agentforge telegram --agent-dir agents/lab-opsThe bot receives text messages, runs the agent, and replies. Shows "typing..." during processing and signals in the reply if guardrails were triggered.
The respond_or_tool mode activates the model-driven tool calling pipeline:
# agent.yaml
workflow:
mode: respond_or_tool
max_tool_cycles: 5Flow per cycle:
- Inference with tool schema (native OpenAI/Ollama protocol)
- Model decides: direct response or tool call
- If tool call → execute → inject result → next cycle
- Loop guard: stops if the same
(tool, args)repeats in the same cycle - Upon exhausting
max_tool_cycles→ final inference with all accumulated results
Useful fields in ToolSpec:
tools:
- name: collect_system_health
description: "Collects CPU, RAM, disk, and GPU metrics"
when_to_use: "Whenever the user asks about server status"
when_not_to_use: "Do not use for questions about specific logs"when_to_use and when_not_to_use are injected into the tool description — they act as decision hints that reduce incorrect calls in local models.
AgentForge implements the Voyager pattern: agents create new Python tools during execution, test them with real pytest, and register them permanently. Every registered tool becomes available to any agent in the next session — no restart needed.
Full flow:
tool-builder receives description
→ write_file: implementation.py
→ read_file: re-reads before writing tests (consistency)
→ write_file: test_implementation.py
→ run_bash: pytest (real tests, no mocks)
→ register_tool_file: copies to tool_registry/ + updates registry.yaml
→ Tool available immediately and in every future session
Using the tool-builder:
agentforge run --agent-dir agents/tool-builder --input "
Create a Python tool called rate_limiter that controls requests per sliding window.
After pytest passes, register it in the framework.
"Using a registered tool in another agent:
# agent.yaml
tools:
- name: search_memory
description: Searches agent-mesh shared_memory by LIKE query on key and value.
when_to_use: "Use to retrieve context from previous sessions."
input_schema: '{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}'Tools available in tool_registry/:
| Tool | Description |
|---|---|
search_memory |
Searches ~/.agent-mesh/state.db by LIKE on key and value. Returns top-3. |
Full documentation:
docs/TOOL-REGISTRY.md
Guardrails are checked after the output is generated, before returning to the user.
# agent.yaml
guardrails:
must:
- always use real data from tools
- provide evidence-based recommendations
must_not:
- fabricate metrics
- access files outside the logs directory
- modify any system fileRuntime behavior:
- Output generated (tool calling or direct response)
- Autonomous reflection applied (if configured)
- The model analyzes its own output against
must_notrules - If violation detected: correction prompt + re-inference (up to 2 retries)
- Persistent violations recorded in
result["metadata"]["guardrail_violations"]
An orchestrator is a regular agent that declares workers in workflow.agents. The engine automatically injects run_agent into the tool schema — the model decides when and to whom to delegate.
# agents/orchestrator/agent.yaml
workflow:
mode: respond_or_tool
max_tool_cycles: 8
agents:
- name: lab-ops
agent_dir: agents/lab-ops
description: Server health monitoring and log inspection
- name: another-agent
agent_dir: agents/another-agent
description: Document analysisDelegation flow:
- Orchestrator receives task from user
- Model analyzes and decides to delegate → calls
run_agent(agent_dir=..., input=...) - Engine loads the worker, runs
runtime.run(input), returns the output - Worker output is injected into the orchestrator's history
- Orchestrator synthesizes all results into a final response
Workers only appear in the schema if declared — the model cannot invent agents outside the spec.
agentforge run --agent-dir agents/orchestrator --input "Full server report"The agent can review its own output before returning:
# agent.yaml
workflow:
mode: respond_or_tool
reflection_rounds: 2Each round applies a structured self-critique prompt:
- Is the response complete and accurate?
- Does it respect the role constraints?
- Can it be more objective?
Reflection is stateless (no history) and runs N times sequentially.
agentforge eval \
--agent-dir agents/lab-ops \
--dataset agents/lab-ops/eval_dataset.yamlDataset format:
cases:
- input: "How is the server health?"
notes: "should collect real data before responding"
- input: "Are there errors in the latest system logs?"
notes: "should use read_log_tail"Results saved to agents/<id>/eval_runs/<timestamp>.jsonl.
Enable automatic scoring in agent.yaml:
eval:
judge_model: "gemma4:e4b"
criteria:
- response based on real data
- objective and actionable recommendation
- no fabricated metricsThe judge scores each criterion from 0–100 and calculates an average score. Supports local Ollama models or gemini-* via Gemini API.
agents-framework/
├── agents/ # Agents (each self-contained)
│ ├── lab-ops/ # Infra monitoring (reference agent)
│ ├── tool-builder/ # Creates and registers Python tools
│ ├── forge-f3/ # FX + crypto analysis (FORGE F3 benchmark)
│ ├── real-p3/ # Python tool with real tests (REAL P3 benchmark)
│ ├── real-p4/ # Skill generator (REAL P4 benchmark)
│ └── orchestrator/ # Multi-agent orchestrator
│
├── tool_registry/ # Tools generated by agents
│ ├── registry.yaml # Persistent manifest (managed automatically)
│ └── search_memory.py # Searches agent-mesh shared_memory
│
├── src/agentforge/
│ ├── channels/ # Execution channels
│ │ ├── http.py # FastAPI REST (n8n, automations)
│ │ ├── mcp_server.py # FastMCP (Claude Code/Desktop)
│ │ └── telegram.py # Telegram bot (async polling)
│ ├── providers/
│ │ ├── ollama.py # Ollama integration (chat + generate + think:false)
│ │ └── mock.py # Deterministic provider for tests
│ ├── runtime/
│ │ ├── engine.py # AgentRuntime: full pipeline
│ │ └── memory.py # History, window, persistence
│ ├── tools/
│ │ ├── registry.py # _ToolRegistry: builtins + dynamic
│ │ ├── dynamic_loader.py # Loads tool_registry/ on init
│ │ ├── register_tool_file.py # Validates, copies, and registers Python files
│ │ ├── write_file.py # File read/write (AGENT_WORKDIR)
│ │ ├── run_bash.py # Bash with destructive command blocklist
│ │ ├── http_get.py # HTTP GET
│ │ └── send_claudio.py # Telegram notification (Claudio bot)
│ └── generators/
│ └── agent_files.py # Artifact generation from spec
│
├── scripts/
│ └── run_benchmark_eval.py # FORGE + REAL scenario runner
│
├── tests/ # 278 tests (MockProvider, no Ollama required)
├── docs/
│ ├── ARCHITECTURE.md # Full technical reference
│ ├── MODEL-STRATEGY.md # qwen3.5 model selection (empirical)
│ ├── FINETUNING-STRATEGY.md # LoRA fine-tuning strategy
│ └── TOOL-REGISTRY.md # Tool Registry: agents that create tools
├── .mcp.json # MCP config for Claude Code
└── pyproject.toml
| Command | Description |
|---|---|
agentforge wizard |
Creates spec interactively |
agentforge generate --path <yaml> |
Generates artifacts from spec |
agentforge validate [--root .] |
Validates framework specs |
agentforge validate-agent --path <yaml> |
Validates an agent.yaml |
agentforge run --agent-dir <dir> --input <text> |
Runs the agent |
agentforge eval --agent-dir <dir> --dataset <yaml> |
Evaluates with dataset |
agentforge serve --agent-dir <dir> [--port 8080] |
Starts HTTP API |
agentforge mcp [--transport stdio|http] |
Starts MCP server |
agentforge telegram --agent-dir <dir> [--token <tok>] |
Starts Telegram bot |
# All tests
pytest -q
# Specific file
pytest tests/test_runtime_engine.py -v
# HTTP channel
pytest tests/test_http_channel.py -v
# Telegram channel
pytest tests/test_telegram_channel.py -vThe project uses MockProvider (deployment.provider: mock) for deterministic tests — no Ollama required in CI.
For detailed technical documentation, see docs/ARCHITECTURE.md.