magic-alt · magic-alt · May 22, 2026 · May 22, 2026
@@ -11,6 +11,14 @@ __pycache__/
 data/
 tmpclaude-*
 
+# Benchmark data policy: commit manifests/schemas/thresholds, keep raw/private data local.
+benchmarks/raw/
+benchmarks/private/
+benchmarks/**/*.pdf
+benchmarks/**/*.xlsx
+benchmarks/**/*.xls
+benchmarks/**/labels.private.json
+
 # Example runtime artifacts
 examples/**/output/
 examples/**/fixtures/*.pdf

@@ -1,12 +1,14 @@
 # Jetbot
 
-Jetbot is a financial report analysis platform that turns PDF filings into structured financial statements, key notes, risk signals, event-study outputs, and trader-style summaries. It combines PDF extraction, validation, LLM orchestration, a FastAPI backend, and a Vue dashboard in one repository.
+Jetbot is a Filing-to-Model Copilot and Financial Fact Platform for evidence-backed financial report extraction. It turns PDF filings into canonical financial facts, structured statements, key notes, risk signals, event-study outputs, and analyst-ready summaries.
 
-It is designed for teams that need a single workflow to ingest reports, inspect extracted evidence, and ship the results through an API, a CLI, or a browser UI.
+It is designed for teams that need a single workflow to ingest reports, inspect source evidence, review and correct extracted facts, and ship the results through an API, a CLI, exports, or a browser UI.
 
 ## Highlights
 
-- End-to-end PDF pipeline for raw text, tables, statements, notes, and report generation.
+- End-to-end PDF pipeline for raw text, tables, statements, notes, facts, and report generation.
+- Canonical financial fact layer with page/table/cell evidence metadata for review and downstream exports.
+- Evaluation runner with machine-readable reports and configurable quality thresholds.
 - Works in mock mode out of the box, with optional OpenAI and Anthropic model routing.
 - Vue 3 dashboard for reviewing original PDFs alongside extraction and analysis outputs.
 - Docker-first local stack with API, worker, Redis, PostgreSQL, and MinIO.
@@ -17,13 +19,13 @@ It is designed for teams that need a single workflow to ingest reports, inspect
 
 ```mermaid
 flowchart LR
-    A[Financial PDF] --> B[PDF extraction and OCR]
-    B --> C[Normalization and validation]
-    C --> D[LLM enrichment and report generation]
-    C --> E[Risk signals and event study]
-    D --> F[FastAPI and CLI]
-    E --> F
-    F --> G[Vue dashboard at /ui]
+  A[Financial Filing PDF] --> B[PDF extraction and OCR]
+  B --> C[Statements and canonical facts]
+  C --> D[Evidence and validation]
+  D --> E[Review, API, and exports]
+  D --> F[Risk signals and analyst reports]
+  E --> G[Vue dashboard at /ui]
+  F --> G
 ```
 
 ## Quick Start
@@ -81,7 +83,7 @@ After startup, the main entry points are:
 | Surface | URL / Command | Notes |
 | --- | --- | --- |
 | Web UI | `http://127.0.0.1:18000/ui/` | Review uploaded PDFs, tables, statements, signals, and generated reports |
-| API | `http://127.0.0.1:18000/v1` | Programmatic ingestion and retrieval |
+| API | `http://127.0.0.1:18000/v1` | Programmatic ingestion and retrieval, including canonical facts |
 | OpenAPI docs | `http://127.0.0.1:18000/docs` | Interactive API explorer |
 | Health | `http://127.0.0.1:18000/health` | Liveness probe |
 | Metrics | `http://127.0.0.1:18000/metrics` | Prometheus endpoint |
@@ -153,6 +155,7 @@ pip install -e ".[all]"
 ```bash
 make test
 make eval
+python scripts/eval.py --thresholds benchmarks/thresholds/golden_minimum.json
 make fmt
 make lint
 make typecheck
@@ -164,12 +167,19 @@ The repository is organized around a small number of clear surfaces:
 
 - `src/api/` for HTTP entry points and application wiring
 - `src/pdf/` for extraction, rendering, tables, and OCR
-- `src/finance/` for schemas, normalization, validation, and signal logic
+- `src/finance/` for facts, normalization, validation, and signal logic
 - `src/agent/` for pipeline orchestration and state handling
 - `src/market/` for event-study analysis and market providers
 - `web/` for the Vue 3 dashboard
 - `tests/` for API, storage, pipeline, frontend-adjacent, and integration coverage
-- `docs/` for architecture, branch protection, and project notes
+- `benchmarks/` for benchmark manifest schemas, threshold configs, and non-sensitive sample manifests
+- `docs/` for architecture, branch protection, roadmap, and project notes
+
+## Benchmark Data Policy
+
+Benchmark manifests, anonymized labels, synthetic fixtures, schemas, and threshold configs can be committed. Raw third-party or proprietary PDFs, private labels, customer files, and generated benchmark artifacts must stay out of git.
+
+Use `benchmarks/raw/` or `benchmarks/private/` for local-only datasets. Those paths are ignored by git. Store only stable metadata, expected facts, expected evidence pointers, and licensing notes in committed manifests.
 
 ## Contributing
 

@@ -0,0 +1,27 @@
+# Benchmark Manifests
+
+This directory stores committed benchmark metadata for Jetbot evaluation. It is for manifests, schemas, anonymized labels, synthetic fixtures, and quality threshold configs only.
+
+Do commit:
+
+- `manifest.schema.json`
+- anonymized benchmark manifests
+- synthetic fixture metadata
+- expected facts, expected evidence pointers, expected note/risk labels
+- threshold configs under `thresholds/`
+
+Do not commit:
+
+- raw third-party or proprietary PDFs
+- private customer reports
+- non-anonymized analyst labels
+- generated eval outputs
+- files under `benchmarks/raw/` or `benchmarks/private/`
+
+Run the current golden evaluation gate with:
+
+```bash
+python scripts/eval.py --thresholds benchmarks/thresholds/golden_minimum.json
+```
+
+Real PDF benchmark manifests should point to local-only files through relative paths such as `raw/company-2025-10k.pdf`. Those raw files are intentionally ignored by git.
@@ -0,0 +1,96 @@
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "https://github.com/magic-alt/jetbot/benchmarks/manifest.schema.json",
+  "title": "Jetbot Benchmark Manifest",
+  "type": "object",
+  "additionalProperties": false,
+  "required": ["schema_version", "benchmark_id", "name", "cases"],
+  "properties": {
+    "schema_version": {"type": "integer", "const": 1},
+    "benchmark_id": {"type": "string", "minLength": 1},
+    "name": {"type": "string", "minLength": 1},
+    "description": {"type": "string"},
+    "data_policy": {
+      "type": "object",
+      "additionalProperties": false,
+      "required": ["raw_files_committed", "label_policy"],
+      "properties": {
+        "raw_files_committed": {"type": "boolean", "const": false},
+        "label_policy": {"type": "string", "enum": ["synthetic", "anonymized", "private"]},
+        "notes": {"type": "string"}
+      }
+    },
+    "cases": {
+      "type": "array",
+      "minItems": 1,
+      "items": {"$ref": "#/$defs/case"}
+    }
+  },
+  "$defs": {
+    "case": {
+      "type": "object",
+      "additionalProperties": false,
+      "required": ["case_id", "source", "expected_facts"],
+      "properties": {
+        "case_id": {"type": "string", "minLength": 1},
+        "company": {"type": "string"},
+        "ticker": {"type": "string"},
+        "filing_type": {"type": "string"},
+        "period_end": {"type": "string", "format": "date"},
+        "source": {
+          "type": "object",
+          "additionalProperties": false,
+          "required": ["type", "path"],
+          "properties": {
+            "type": {"type": "string", "enum": ["synthetic", "pdf", "html", "xbrl"]},
+            "path": {"type": "string", "minLength": 1},
+            "license": {"type": "string"},
+            "sha256": {"type": "string"}
+          }
+        },
+        "expected_facts": {
+          "type": "array",
+          "items": {"$ref": "#/$defs/fact"}
+        },
+        "expected_notes": {
+          "type": "array",
+          "items": {"type": "string"}
+        },
+        "expected_risk_categories": {
+          "type": "array",
+          "items": {"type": "string"}
+        }
+      }
+    },
+    "fact": {
+      "type": "object",
+      "additionalProperties": false,
+      "required": ["statement_type", "concept", "value"],
+      "properties": {
+        "statement_type": {"type": "string", "enum": ["income", "balance", "cashflow", "note", "other"]},
+        "concept": {"type": "string", "minLength": 1},
+        "label": {"type": "string"},
+        "value": {"type": "number"},
+        "unit": {"type": "string"},
+        "currency": {"type": "string"},
+        "period_end": {"type": "string", "format": "date"},
+        "evidence": {
+          "type": "array",
+          "items": {"$ref": "#/$defs/evidence"}
+        }
+      }
+    },
+    "evidence": {
+      "type": "object",
+      "additionalProperties": false,
+      "required": ["page"],
+      "properties": {
+        "page": {"type": "integer", "minimum": 1},
+        "table_id": {"type": "string"},
+        "row": {"type": "integer", "minimum": 0},
+        "col": {"type": "integer", "minimum": 0},
+        "quote": {"type": "string"}
+      }
+    }
+  }
+}
@@ -0,0 +1,41 @@
+{
+  "schema_version": 1,
+  "benchmark_id": "synthetic-smoke-v1",
+  "name": "Synthetic Smoke Benchmark",
+  "description": "A committed example manifest showing the expected shape for benchmark metadata. It does not reference real proprietary files.",
+  "data_policy": {
+    "raw_files_committed": false,
+    "label_policy": "synthetic",
+    "notes": "Use synthetic or anonymized labels in git. Keep real PDFs under ignored local paths."
+  },
+  "cases": [
+    {
+      "case_id": "synthetic-income-001",
+      "company": "Example Co",
+      "ticker": "EXM",
+      "filing_type": "10-Q",
+      "period_end": "2025-12-31",
+      "source": {
+        "type": "synthetic",
+        "path": "tests/golden/conftest.py",
+        "license": "synthetic"
+      },
+      "expected_facts": [
+        {
+          "statement_type": "income",
+          "concept": "revenue",
+          "label": "Revenue",
+          "value": 100.0,
+          "unit": "USD millions",
+          "currency": "USD",
+          "period_end": "2025-12-31",
+          "evidence": [
+            {"page": 1, "quote": "Revenue 100"}
+          ]
+        }
+      ],
+      "expected_notes": ["other"],
+      "expected_risk_categories": []
+    }
+  ]
+}
@@ -0,0 +1,12 @@
+{
+  "schema_version": 1,
+  "description": "Initial non-regression thresholds for the synthetic golden suite. Tighten these as extraction quality improves.",
+  "min_metrics": {
+    "n_cases": 5,
+    "avg_source_ref_completeness": 1.0,
+    "avg_signal_category_recall": 0.8,
+    "avg_note_type_recall": 0.6,
+    "avg_fact_value_accuracy": 0.08,
+    "avg_fact_source_ref_completeness": 0.34
+  }
+}
@@ -79,7 +79,7 @@ Jetbot 当前已经具备较完整的财报 PDF Agent MVP 能力：PDF 上传、
 
 ## 4. 已完成的第一实现切片
 
-本路线图的第一切片已经在当前分支 `feat/financial-fact-foundation` 中实现，目标是为后续人工复核、导出和 benchmark 建立事实层底座。
+本路线图的第一切片已通过 PR12 合并到 `main`，目标是为后续人工复核、导出和 benchmark 建立事实层底座。
 
 ### 4.1 Schema 与证据模型
 
@@ -162,7 +162,7 @@ Jetbot 当前已经具备较完整的财报 PDF Agent MVP 能力：PDF 上传、
 
 - 文档和 README 中明确 Jetbot 的下一阶段定位。
 - 每个 P0 feature 都能映射到质量指标。
-- 不把真实敏感 PDF 提交到仓库。
+- 不把真实敏感 PDF 提交到仓库；真实样本只保存在本地或私有存储，仓库只提交 manifest、匿名标签、合成 fixture、schema 和阈值配置。
 
 ### Phase 1：Benchmark 与 Eval CI，Week 1-2
 
@@ -192,6 +192,7 @@ Jetbot 当前已经具备较完整的财报 PDF Agent MVP 能力：PDF 上传、
 验收标准：
 
 - `python scripts/eval.py --output-dir data/eval-dev` 可生成报告。
+- `python scripts/eval.py --thresholds benchmarks/thresholds/golden_minimum.json` 可作为质量门槛，指标低于阈值时返回非 0。
 - 报告包含 document-level 与 aggregate metrics。
 - synthetic golden gate 可稳定在 CI 中运行。
 - real PDF benchmark 可本地运行，且不会把敏感样本提交到 git。
@@ -574,13 +575,14 @@ docker compose up --build
 
 ## 10. 下一步推荐执行顺序
 
-1. 完成 correction API 和 effective facts。
-2. 在前端增加 facts tab 或 review panel。
-3. 给 `PdfViewer` 增加 bbox overlay。
-4. 给 `EvidenceLink` 增加 row/col/bbox payload。
-5. 增加 Excel/CSV/JSON export。
-6. 扩展 benchmark manifest 和 threshold gate。
-7. 开始 table router protocol。
-8. 再接 SEC/XBRL/HTML ingestion。
+1. 收口 Phase 0：README/路线图正式定位为 Filing-to-Model Copilot / Financial Fact Platform，并文档化 benchmark 数据政策。
+2. 完成 Phase 1 评测门槛：benchmark manifest schema、样例 manifest、threshold 配置和 eval gate。
+3. 完成 correction API 和 effective facts。
+4. 在前端增加 facts tab 或 review panel。
+5. 给 `PdfViewer` 增加 bbox overlay。
+6. 给 `EvidenceLink` 增加 row/col/bbox payload。
+7. 增加 Excel/CSV/JSON export。
+8. 开始 table router protocol。
+9. 再接 SEC/XBRL/HTML ingestion。
 
 这一路线的判断标准很简单：每增加一个能力，都必须让 facts 更准确、证据更可审计、复核更省时间、输出更能进入真实 analyst workflow。