FORGE — Framework for Open Real-world Generic Evaluation

Stage 3 of 4 in a progressive benchmark methodology for evaluating local LLMs on agentic tasks.

ABS → LOP → FORGE → REAL

What is FORGE?

FORGE evaluates LLM agents on chained real-world tasks — not isolated questions, but full agentic loops with tool use, artifact creation, and measurable delivery.

Each scenario requires the model to:

Receive a goal
Plan and call tools autonomously
Produce concrete artifacts (files, HTTP responses, reports, code fixes)
Be scored on the quality of what it delivered

This moves beyond "can it answer a question?" toward "can it do the job?"

The Certification Funnel

FORGE is not a ranking benchmark. It is gate 3 of 4 in a progressive model certification funnel.

19 models entered ABS. Each stage eliminates models that can't meet the next bar. By the time a model reaches FORGE, it has already proven it can call tools correctly (ABS) and hold up under real operational pressure (LOP).

Stage	Gate question	What it proves	Filter
ABS	Can it call tools at all?	Tool mechanics, parameter accuracy, structured output	19 entered
LOP	Does it hold under real pressure?	Consistency under operational load, no external APIs	—
FORGE ← you are here	Can it function as an agent?	Multi-turn chaining, autonomous planning, deliverable output	7 entered
REAL	Does it work in production?	Real browser, tests that pass, enterprise-grade tasks	4 proven
agent-FORGE	Deploy	Production runtime for models that survived the full funnel	—

FORGE is where the question changes from "can it do tool calls?" to "can it be an agent?" — multi-turn, chained tools, autonomous planning, measurable delivery. 7 models entered; only those that proved agentic capability advanced to REAL.

Scenarios (F1–F5)

ID	Name	Difficulty	What it measures
F1	Real Estate Web App	High	Full-stack frontend — HTML/CSS/JS, fetch API, JSON filtering, responsive design
F2	Web Analysis + Report + Telegram	Medium	HTTP scraping, structured analysis, report writing, Telegram notification
F3	Market Intelligence — FX/Crypto	Medium	Multi-API orchestration, financial analysis, formatted delivery
F4	Code Review + Bug Fix	High	Code reading, bug identification, automated test validation
F5	Code Review at 60 KB context	Very High	Long-context coherence, multi-deliverable output without collapse

Scoring System

Each scenario is scored across four dimensions:

Dimension	Weight	Evaluator	What it measures
AUTO	30%	`forge_runner.py`	Objective criteria: file exists, server responds, function called, test passes
LLM-JUDGE	30%	`gemma4:26b` via `forge_judge.py`	Output quality against a scenario-specific rubric
CLAUDE	20%	Claude Code	Technical completeness, correctness, edge cases
HUMAN	20%	Author	Aesthetics, usability, "would I use this in production?"

Composite score:

composite = (auto_norm×0.30 + llm_judge×0.30 + claude×0.20 + human×0.20)

Scale: 0–4 (consistent with ABS and LOP for longitudinal comparison).

Results

Benchmarked on fox-server: Xeon E5-2696v3 (18c/36t) · 128 GB ECC RAM · 2× RTX 3060 12 GB (24 GB VRAM total). No cloud. No external GPU.

#	Model	Harness	F1	F2	F3	F4	F5	Total
🥇	claude-sonnet-4-6	Claude API	81%	100%	100%	100%	83%	91%
🥈	gemma4:26b	Aurelia/Telegram	92%	100%	100%	100%	67%	89%†
🥉	qwen3.5:9b	FORGE direct	77%	100%	100%	89%	83%	87%
4	qwen3.6:27b	FORGE direct	—	—	—	—	78%	78%†
5	gemma4:26b	FORGE direct	100%	—	—	—	56%	71%†

†Partial coverage — evaluated only on completed scenarios.

Key findings:

claude-sonnet-4-6 achieved 91% with a validated equivalent harness. F4 (Code Review) completed in 5 turns with parallel file reads — a standout result.
gemma4:26b reached 89% via the Aurelia/Telegram harness. Best absolute scores on tested scenarios.
qwen3.5:9b (fully local, 9B parameters) reached 87% — comparable to 26B models on this task set.
F5 (60 KB context) was the discriminating scenario: models that degraded here showed long-context collapse under realistic pressure.

Repository Structure

forge/
├── scenarios/          # Scenario definitions (F1.json – F5.json)
│   ├── fixtures/       # Static test fixtures
│   └── prds/           # Product requirements docs used in F1/F4/F5
├── docs/
│   ├── SCORING.md      # Scoring rubric by dimension and scenario
│   ├── FORGE-CRITIQUE-v0.1.md
│   └── RELATORIO-FORGE-v1.md  # Full results report
├── results/            # Run outputs by scenario (F1/ – F5/)
├── scripts/            # Utilities
├── forge_pipeline.py   # Core evaluation pipeline
├── run_f*.sh           # Batch runners per scenario
└── logs/               # Execution logs

What Came After FORGE

FORGE's findings fed directly into REAL (Stage 4) — a stricter evaluation using browser automation, real test suites, and SPA interaction, eliminating another round of models.

After REAL, the accumulated learnings from all 4 stages shaped the design of agent-FORGE — a production multi-agent framework. Its core architectural choices trace directly to benchmark observations:

The runtime matters as much as the model. Prompt construction, tool schemas, loop control, and reflection rounds explained more score variance than model size. → agent-FORGE implements spec-first YAML agents with active guardrails and autonomous reflection.
Local models are viable for production. qwen3.5:9b at 87% on tasks requiring web scraping, code review, multi-API orchestration, and report writing — entirely on consumer hardware. → agent-FORGE defaults to local-first Ollama with no cloud dependency.
Memory is the missing layer. Agents that couldn't refer back to prior context degraded across multi-step tasks. → agent-FORGE ships a 3-tier memory system (SQLite · Qdrant/mem0 · Kuzu graph).

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
docs		docs
fixtures		fixtures
logs		logs
results		results
scenarios		scenarios
scripts		scripts
.gitignore		.gitignore
.models_f2.txt		.models_f2.txt
.models_f3.txt		.models_f3.txt
.models_f4.txt		.models_f4.txt
.models_f5.txt		.models_f5.txt
README.md		README.md
forge_notify_watcher.py		forge_notify_watcher.py
forge_pipeline.py		forge_pipeline.py
run_f2_batch.sh		run_f2_batch.sh
run_f3_batch.sh		run_f3_batch.sh
run_f4_batch.sh		run_f4_batch.sh
run_f5_batch.sh		run_f5_batch.sh
run_scenario_batch.sh		run_scenario_batch.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FORGE — Framework for Open Real-world Generic Evaluation

What is FORGE?

The Certification Funnel

Scenarios (F1–F5)

Scoring System

Results

Repository Structure

What Came After FORGE

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FORGE — Framework for Open Real-world Generic Evaluation

What is FORGE?

The Certification Funnel

Scenarios (F1–F5)

Scoring System

Results

Repository Structure

What Came After FORGE

Related

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages