Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
3eb5354
bump version
wangxingjun778 Apr 13, 2026
b72a878
Introduce Sirchmunk Learnings (insights from pageindex and LLM wiki)
wangxingjun778 Apr 13, 2026
c4f4b16
improve compile infer
wangxingjun778 Apr 13, 2026
6458477
improve search pipeline for compile mode
wangxingjun778 Apr 13, 2026
1f6f799
fix and enhance llm wiki and tree index for FAST search
wangxingjun778 Apr 14, 2026
077be35
fix _extract_catalog_keywords for llm wiki
wangxingjun778 Apr 14, 2026
a602197
add tree guided sampling
wangxingjun778 Apr 14, 2026
8233c35
fix compile quality and large-file processing
wangxingjun778 Apr 14, 2026
1de1c98
adopt the latest compile processing
wangxingjun778 Apr 14, 2026
938ced1
refactor tree indexing with toc
wangxingjun778 Apr 15, 2026
7b65b4b
ok Merge branch 'main' of github.com:modelscope/sirchmunk into feat/s…
wangxingjun778 Apr 15, 2026
29c0909
enhance compile for excel and add embedding fallback for rga keywords…
wangxingjun778 Apr 15, 2026
d1f1fd4
fix storage
wangxingjun778 Apr 15, 2026
caf8e05
add financebench
wangxingjun778 Apr 16, 2026
4a0a017
add llm judge for financebench
wangxingjun778 Apr 16, 2026
613c099
Adapt older knowledge cluster data structure
wangxingjun778 Apr 16, 2026
6858418
update finance bench readme
wangxingjun778 Apr 16, 2026
9441ef2
refactor config for finbench
wangxingjun778 Apr 16, 2026
f1f86fa
refactor financebench readme
wangxingjun778 Apr 17, 2026
0e46ef5
update readme for finbench
wangxingjun778 Apr 17, 2026
2cf5c37
enhance tree indexes usage for search pipeline
wangxingjun778 Apr 17, 2026
c0b0db5
fix issues
wangxingjun778 Apr 17, 2026
e8184d0
update tree index
wangxingjun778 Apr 17, 2026
dc27ed9
update finbench readme
wangxingjun778 Apr 17, 2026
8723b85
update finbench readme
wangxingjun778 Apr 17, 2026
ca9a609
update should answer thres
wangxingjun778 Apr 17, 2026
34c181e
fix eval for finbench in runner
wangxingjun778 Apr 17, 2026
2b4714e
refactor metrics as LLM judge for finbench
wangxingjun778 Apr 17, 2026
a184e86
update config
wangxingjun778 Apr 17, 2026
eb43fdd
refactor doc extractor
wangxingjun778 Apr 17, 2026
b2c26bb
enhance compiler for tree indexing
wangxingjun778 Apr 19, 2026
d4e8fe3
fix table extraction
wangxingjun778 Apr 19, 2026
1d550bf
fix warning
wangxingjun778 Apr 19, 2026
384d345
enhance compiler
wangxingjun778 Apr 19, 2026
579f8d6
fix robust issue
wangxingjun778 Apr 19, 2026
78c1117
fix pure tree search env
wangxingjun778 Apr 20, 2026
86d528e
improve tree index
wangxingjun778 Apr 20, 2026
2f3a257
improve search tree index
wangxingjun778 Apr 26, 2026
9dd47be
update log
wangxingjun778 Apr 26, 2026
8ff1f98
enhance search fast for compile
wangxingjun778 Apr 26, 2026
464d8d5
enhance tree index
wangxingjun778 Apr 26, 2026
bdd8bdc
fix review
wangxingjun778 Apr 26, 2026
d3b91d6
improve kreuzberg table extraction
wangxingjun778 Apr 26, 2026
63ed047
enhance compiler table extraction
wangxingjun778 Apr 26, 2026
b760119
fix table extraction
wangxingjun778 Apr 27, 2026
e55ada7
improve compile for summary and table
wangxingjun778 May 9, 2026
fe351a1
fix tree index
wangxingjun778 May 9, 2026
cb1ba96
update compiler
wangxingjun778 May 11, 2026
93d4a1f
improve compile efficiency
wangxingjun778 May 11, 2026
929fbc5
improve compile mem usage
wangxingjun778 May 11, 2026
207fe59
improve extractor multi-processing
wangxingjun778 May 11, 2026
bbc2bbd
fix ProcessPoolExecutor
wangxingjun778 May 11, 2026
5af51df
clean methods for compiler
wangxingjun778 May 11, 2026
af5f7e1
improve all corpus
wangxingjun778 May 12, 2026
7439521
tree index and rga fusion
wangxingjun778 May 12, 2026
cec209d
fallback hybrid tree indexing
wangxingjun778 May 12, 2026
59beaea
improve search pipeline for hybrid
wangxingjun778 May 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -270,4 +270,5 @@ benchmarks/*/data/
benchmarks/*/.env*
benchmarks/*/logs/
benchmarks/*/results/
benchmarks/*/output/
benchmarks/*/output/
benchmarks/*/.work/
268 changes: 268 additions & 0 deletions benchmarks/financebench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
# FinanceBench Benchmark

FinanceBench evaluation pipeline for **Sirchmunk AgenticSearch**.

## Overview

[FinanceBench](https://arxiv.org/abs/2311.11944) is an open-book financial QA benchmark
with **150 expert-annotated questions** across **40+ US public companies** (10-K/10-Q filings).

### Evaluation Modes

| Mode | Description |
|------|-------------|
| `singleDoc` | Each question searches only its target PDF (standard) |
| `sharedCorpus` | All questions search the full 41-PDF corpus |

### Metrics

- **3-Class Scoring**: Correct / Hallucination / Refusal (per FinanceBench paper)
- **EM / F1**: Exact Match and token-level F1 with financial value normalisation
- **Evidence Recall**: Retrieved pages vs gold evidence pages

## Prerequisites

### Step 1: Install Sirchmunk

Install Sirchmunk from the repository root so that the `sirchmunk` CLI is available:

```bash
# From repository root
pip install -e .
```

Verify the installation:

```bash
sirchmunk --version
```

### Step 2: Prepare Dataset

Download the [FinanceBench](https://huggingface.co/datasets/PatronusAI/financebench)
dataset and place the files under `benchmarks/financebench/data/`:

```
data/
β”œβ”€β”€ financebench_open_source.jsonl # 150 expert-annotated QA pairs
└── pdfs/ # 41 SEC-filing PDFs (10-K / 10-Q)
β”œβ”€β”€ 3M_2018_10K.pdf
β”œβ”€β”€ AMCOR_2023_10K.pdf
└── ...
```

Each PDF filename must match the `doc_name` field in the JSONL file.

### Step 3: Initialize Experiment Workspace

Initialize an isolated workspace for this experiment. This keeps the knowledge base
and cache separate from the default `~/.sirchmunk`:

```bash
cd benchmarks/financebench
sirchmunk init --work-path .work
```

This creates a `.work/` directory containing a **platform .env** file (`.work/.env`).

**Configure the platform .env** (`.work/.env`):

This file controls the LLM provider used by Sirchmunk's search engine.
You **must** set valid LLM credentials here before proceeding.

| Variable | Required | Description | Example |
|----------|----------|-------------|-----------------------------------------------------|
| `LLM_API_KEY` | **Yes** | API key for the LLM provider | `sk-xxx` |
| `LLM_BASE_URL` | **Yes** | LLM API endpoint | `https://dashscope.aliyuncs.com/compatible-mode/v1` |
| `LLM_MODEL_NAME` | **Yes** | Model name for search & QA | `qwen3.6-plus` |
| `LLM_TIMEOUT` | No | Request timeout in seconds | `120` |

```bash
# Edit the platform .env
vi .work/.env
```

### Step 4: Knowledge Compiling

Compile the PDF corpus into the experiment workspace so that Sirchmunk can search it:

```bash
sirchmunk compile --work-path .work --paths data/pdfs
```

> **Note:** This step parses, chunks, and indexes all PDFs.
> For FinanceBench's all PDFs, expect hours of processing time, depending on your LLM speed and compute resources.

#### Shallow Compile (Recommended for First Run)

Use `--shallow` to skip tree indexing and only generate Summary + Topics.
This reduces LLM calls dramatically and achieves **5–9Γ— speedup**:

```bash
sirchmunk compile --work-path .work --paths data/pdfs --shallow
```

> **Tip:** `--shallow` is ideal for quickly compiling a large corpus on the first pass.
> You can run a normal (full) compile later to incrementally add tree indexes.

### Step 5: Configure Experiment

Create the **experiment .env** from the template:

```bash
cp .env.example .env.financebench
```

**Configure the experiment .env** (`.env.financebench`):

This file controls FinanceBench-specific evaluation parameters.

#### Dataset Paths

| Variable | Required | Description | Default |
|----------|----------|-------------|---------|
| `FB_WORK_PATH` | No | Isolated workspace path | `./.work` |
| `FB_DATA_DIR` | **Yes** | Directory containing `financebench_open_source.jsonl` | `./data` |
| `FB_PDF_DIR` | **Yes** | Directory containing the 41 PDF files | `./data/pdfs` |
| `FB_OUTPUT_DIR` | No | Results output directory | `./output` |

#### Dataset Settings

| Variable | Required | Description | Default |
|----------|----------|-------------|---------|
| `FB_LIMIT` | No | Number of questions to evaluate (`0` = all 150) | `0` |
| `FB_SEED` | No | Random seed for reproducibility | `42` |

#### Search Settings

| Variable | Required | Description | Default |
|----------|----------|-------------|---------|
| `FB_MODE` | No | Search mode: `FAST` or `DEEP` | `FAST` |
| `FB_TOP_K_FILES` | No | Max files returned per search | `5` |
| `FB_MAX_TOKEN_BUDGET` | No | Token budget for search context | `128000` |
| `FB_ENABLE_DIR_SCAN` | No | Enable directory-level scanning | `true` |

#### Evaluation Settings

| Variable | Required | Description | Default |
|----------|----------|-------------|---------|
| `FB_EVAL_MODE` | No | `singleDoc` (per-PDF) or `sharedCorpus` (all PDFs) | `singleDoc` |
| `FB_ENABLE_LLM_JUDGE` | No | Enable LLM Judge for semantic equivalence | `true` |
| `FB_EXTRACT_ANSWER` | No | Extract short answer from verbose response | `true` |

#### Concurrency Settings

| Variable | Required | Description | Default |
|----------|----------|-------------|---------|
| `FB_MAX_CONCURRENT` | No | Max concurrent evaluation requests | `3` |
| `FB_REQUEST_DELAY` | No | Delay between requests in seconds | `0.5` |

**Optional LLM Override**: If you want this experiment to use a **different** LLM
than the platform config, uncomment the `LLM_*` lines in `.env.financebench`.
Otherwise, the experiment inherits LLM settings from `.work/.env`.

```bash
# Edit the experiment .env
vi .env.financebench
```

## Configuration Architecture

Configuration loads with layered inheritance (highest priority wins):

```
Priority (highest β†’ lowest):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Command-line args β”‚ ← --limit N, --env <file>
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ .env.financebench (experiment) β”‚ ← FB_* params + optional LLM override
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ .work/.env (platform) β”‚ ← LLM_API_KEY, LLM_MODEL_NAME, etc.
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Environment variables β”‚ ← os.environ fallback
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Defaults β”‚ ← Hard-coded in FinanceBenchConfig
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### What Goes Where?

| Setting | Platform `.work/.env` | Experiment `.env.financebench` |
|---------|:---------------------:|:------------------------------:|
| LLM API Key | βœ… (required) | Only if overriding |
| LLM Model | βœ… (required) | Only if overriding |
| LLM Base URL | βœ… (required) | Only if overriding |
| LLM Timeout | Optional | Only if overriding |
| PDF directory | β€” | βœ… (required) |
| Data directory | β€” | βœ… (required) |
| Output directory | β€” | Optional |
| Eval mode | β€” | Optional |
| Search mode | β€” | Optional |
| LLM Judge | β€” | Optional |
| Concurrency | β€” | Optional |

### 1. Run

```bash
# Run full benchmark (150 questions)
python run_benchmark.py

# Run with custom config and question limit
python run_benchmark.py --env .env.custom --limit 20
```

### 2. Analyze

```bash
# Analyze a completed run
python analyze_results.py output/results_YYYYMMDD_HHMMSS.jsonl

# Show more error cases
python analyze_results.py output/results_*.jsonl --max-errors 50
```

## Data Format

The dataset file `financebench_open_source.jsonl` contains one JSON object per line:

```json
{
"financebench_id": "financebench_id_00001",
"question": "What is the FY2018 capital expenditure amount for 3M?",
"answer": "$1,577.00",
"doc_name": "3M_2018_10K",
"company": "3M",
"question_type": "fact-based-w-numerical-answer",
"question_reasoning": "retrieve",
"evidence": [{"evidence_text": "...", "evidence_page_num": 42}]
}
```

## File Structure

```
benchmarks/financebench/
β”œβ”€β”€ .env.example # Config template (copy to .env.financebench)
β”œβ”€β”€ config.py # FinanceBenchConfig dataclass
β”œβ”€β”€ data_loader.py # Dataset + PDF corpus loader
β”œβ”€β”€ evaluate.py # EM/F1/3-class scoring + aggregation
β”œβ”€β”€ runner.py # Async batch runner (AgenticSearch)
β”œβ”€β”€ run_benchmark.py # CLI entry point
β”œβ”€β”€ analyze_results.py # Post-hoc analysis tool
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ financebench_open_source.jsonl
β”‚ └── pdfs/ # 41 SEC-filing PDFs
β”œβ”€β”€ output/ # Results + metrics (auto-created)
└── logs/ # Run logs (auto-created)
```

## SOTA Reference

| System | Accuracy | Coverage |
|--------|----------|----------|
| Mafin 2.5 (SOTA) | 98.7% | 100% |
| Fintool | 98.0% | 66.7% |
| Quantly | 94.0% | 100% |
| GPT-4 (zero-shot) | 29.3% | 100% |

> Mafin 2.5 uses PageIndex + Agentic Vectorless RAG 3.0 architecture.
Loading