Stage 3 of 4 in a progressive benchmark methodology for evaluating local LLMs on agentic tasks.
ABS→LOP→FORGE→REAL
FORGE evaluates LLM agents on chained real-world tasks — not isolated questions, but full agentic loops with tool use, artifact creation, and measurable delivery.
Each scenario requires the model to:
- Receive a goal
- Plan and call tools autonomously
- Produce concrete artifacts (files, HTTP responses, reports, code fixes)
- Be scored on the quality of what it delivered
This moves beyond "can it answer a question?" toward "can it do the job?"
FORGE is not a ranking benchmark. It is gate 3 of 4 in a progressive model certification funnel.
19 models entered ABS. Each stage eliminates models that can't meet the next bar. By the time a model reaches FORGE, it has already proven it can call tools correctly (ABS) and hold up under real operational pressure (LOP).
| Stage | Gate question | What it proves | Filter |
|---|---|---|---|
| ABS | Can it call tools at all? | Tool mechanics, parameter accuracy, structured output | 19 entered |
| LOP | Does it hold under real pressure? | Consistency under operational load, no external APIs | — |
| FORGE ← you are here | Can it function as an agent? | Multi-turn chaining, autonomous planning, deliverable output | 7 entered |
| REAL | Does it work in production? | Real browser, tests that pass, enterprise-grade tasks | 4 proven |
| agent-FORGE | Deploy | Production runtime for models that survived the full funnel | — |
FORGE is where the question changes from "can it do tool calls?" to "can it be an agent?" — multi-turn, chained tools, autonomous planning, measurable delivery. 7 models entered; only those that proved agentic capability advanced to REAL.
| ID | Name | Difficulty | What it measures |
|---|---|---|---|
| F1 | Real Estate Web App | High | Full-stack frontend — HTML/CSS/JS, fetch API, JSON filtering, responsive design |
| F2 | Web Analysis + Report + Telegram | Medium | HTTP scraping, structured analysis, report writing, Telegram notification |
| F3 | Market Intelligence — FX/Crypto | Medium | Multi-API orchestration, financial analysis, formatted delivery |
| F4 | Code Review + Bug Fix | High | Code reading, bug identification, automated test validation |
| F5 | Code Review at 60 KB context | Very High | Long-context coherence, multi-deliverable output without collapse |
Each scenario is scored across four dimensions:
| Dimension | Weight | Evaluator | What it measures |
|---|---|---|---|
| AUTO | 30% | forge_runner.py |
Objective criteria: file exists, server responds, function called, test passes |
| LLM-JUDGE | 30% | gemma4:26b via forge_judge.py |
Output quality against a scenario-specific rubric |
| CLAUDE | 20% | Claude Code | Technical completeness, correctness, edge cases |
| HUMAN | 20% | Author | Aesthetics, usability, "would I use this in production?" |
Composite score:
composite = (auto_norm×0.30 + llm_judge×0.30 + claude×0.20 + human×0.20)
Scale: 0–4 (consistent with ABS and LOP for longitudinal comparison).
Benchmarked on fox-server: Xeon E5-2696v3 (18c/36t) · 128 GB ECC RAM · 2× RTX 3060 12 GB (24 GB VRAM total). No cloud. No external GPU.
| # | Model | Harness | F1 | F2 | F3 | F4 | F5 | Total |
|---|---|---|---|---|---|---|---|---|
| 🥇 | claude-sonnet-4-6 | Claude API | 81% | 100% | 100% | 100% | 83% | 91% |
| 🥈 | gemma4:26b | Aurelia/Telegram | 92% | 100% | 100% | 100% | 67% | 89%† |
| 🥉 | qwen3.5:9b | FORGE direct | 77% | 100% | 100% | 89% | 83% | 87% |
| 4 | qwen3.6:27b | FORGE direct | — | — | — | — | 78% | 78%† |
| 5 | gemma4:26b | FORGE direct | 100% | — | — | — | 56% | 71%† |
†Partial coverage — evaluated only on completed scenarios.
Key findings:
claude-sonnet-4-6achieved 91% with a validated equivalent harness. F4 (Code Review) completed in 5 turns with parallel file reads — a standout result.gemma4:26breached 89% via the Aurelia/Telegram harness. Best absolute scores on tested scenarios.qwen3.5:9b(fully local, 9B parameters) reached 87% — comparable to 26B models on this task set.- F5 (60 KB context) was the discriminating scenario: models that degraded here showed long-context collapse under realistic pressure.
forge/
├── scenarios/ # Scenario definitions (F1.json – F5.json)
│ ├── fixtures/ # Static test fixtures
│ └── prds/ # Product requirements docs used in F1/F4/F5
├── docs/
│ ├── SCORING.md # Scoring rubric by dimension and scenario
│ ├── FORGE-CRITIQUE-v0.1.md
│ └── RELATORIO-FORGE-v1.md # Full results report
├── results/ # Run outputs by scenario (F1/ – F5/)
├── scripts/ # Utilities
├── forge_pipeline.py # Core evaluation pipeline
├── run_f*.sh # Batch runners per scenario
└── logs/ # Execution logs
FORGE's findings fed directly into REAL (Stage 4) — a stricter evaluation using browser automation, real test suites, and SPA interaction, eliminating another round of models.
After REAL, the accumulated learnings from all 4 stages shaped the design of agent-FORGE — a production multi-agent framework. Its core architectural choices trace directly to benchmark observations:
- The runtime matters as much as the model. Prompt construction, tool schemas, loop control, and reflection rounds explained more score variance than model size. → agent-FORGE implements spec-first YAML agents with active guardrails and autonomous reflection.
- Local models are viable for production.
qwen3.5:9bat 87% on tasks requiring web scraping, code review, multi-API orchestration, and report writing — entirely on consumer hardware. → agent-FORGE defaults to local-first Ollama with no cloud dependency. - Memory is the missing layer. Agents that couldn't refer back to prior context degraded across multi-step tasks. → agent-FORGE ships a 3-tier memory system (SQLite · Qdrant/mem0 · Kuzu graph).
- agent-FORGE — the framework that emerged from this research
- KDMILE 2026 — paper submitted to the Brazilian Symposium on Knowledge Discovery and Intelligent Systems
- Conrado Nogueira — full profile and project index
Benchmarked 2026-06-05. Hardware: fox-server (second-hand Xeon + dual RTX 3060). All inference local, no cloud.