Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
ceb9333
Add complete benchmarking system for information retrieval evaluation
suung Oct 11, 2025
659be25
Update pulsar-client version to 3.6.1
suung Dec 18, 2025
88415b8
Add flexible benchmark dataset support and ClimRetrieve integration
suung Dec 18, 2025
d08a085
Add flexible benchmark dataset support and ClimRetrieve integration
suung Dec 18, 2025
7387aa2
Merge main into feature/benchmarking and migrate benchmark code
suung Jan 14, 2026
75754b8
Format benchmark code and add merge assessment
suung Jan 14, 2026
e3326ea
Add .devnotes/ to .gitignore for development assessments
suung Jan 14, 2026
c1ec086
Exclude report_analyst_enterprise from black and isort linting
suung Jan 15, 2026
74e872d
Format test_linting.py with black
suung Jan 15, 2026
a592cb9
Remove emojis from benchmarking feature
suung Jan 24, 2026
a5512e5
Bugs identified, chunk dataset created, cliretrieve yaml created
YauheniHuryn Feb 17, 2026
a9b31cf
Benchmarking tab added
YauheniHuryn Feb 18, 2026
c2882e9
added analysis against climretrieve question set, removed unnecessary…
YauheniHuryn Feb 19, 2026
59f8783
download error analysis dataset implementation, metrics evaluation bu…
YauheniHuryn Feb 20, 2026
a89bacd
creating a possibility for report analyst to be a package
YauheniHuryn Feb 22, 2026
69b00da
fixing toml file
YauheniHuryn Feb 23, 2026
8d3fa23
fixing toml file 2
YauheniHuryn Feb 23, 2026
1cc5a98
fixing toml file 3
YauheniHuryn Feb 23, 2026
16f27f6
fixing toml file 4
YauheniHuryn Feb 24, 2026
712a109
Remove accidental frontend artifacts from benchmark branch
suung Feb 24, 2026
9f69773
Ignore coverage.xml artifact
suung Feb 24, 2026
a7d7d20
Format files for latest Black CI
suung Feb 24, 2026
91712b4
implementing alignment option for the datasets of various structures
YauheniHuryn Feb 26, 2026
89373cb
fixing missing climretrieve test
YauheniHuryn Feb 26, 2026
7782bb1
fixing black test
YauheniHuryn Feb 27, 2026
9447fab
fixing flake8 bug
YauheniHuryn Feb 27, 2026
b925c1e
fixing warning issue
YauheniHuryn Feb 27, 2026
d0d2c49
fixing warning issue
YauheniHuryn Feb 27, 2026
73b3a6b
Merge branch 'feature/benchmarking' into feature/benchmarking-bugfixes
suung Feb 27, 2026
c1ec140
untracking data folder, modifying COLAB.md
YauheniHuryn Feb 28, 2026
477ddf5
Fix CI: benchmark_store fixture, black format, pytest-asyncio
suung Feb 28, 2026
08931bb
edited readme files, added a feature to download aligned datasets fro…
YauheniHuryn Mar 1, 2026
18d0d0b
Merge remote-tracking branch 'origin/feature/benchmarking-bugfixes' i…
YauheniHuryn Mar 1, 2026
6aeb85d
Resolve conflicts after syncing with remote
YauheniHuryn Mar 1, 2026
34b1bfa
deleting merge conflict marker
YauheniHuryn Mar 1, 2026
5b3abd0
Merge pull request #24 from climateandtech/feature/benchmarking-bugfixes
suung Mar 1, 2026
57c24e4
Potential fix for pull request finding 'Unused import'
suung Mar 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ jobs:
- name: Install linting dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 black isort mypy
pip install -c requirements.txt black flake8 isort mypy

- name: Run black (code formatting check)
run: |
Expand Down
14 changes: 13 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
/data
report_analyst/data
.env
*.log
venv/
venv2/

.gradio
__pycache__/
Expand All @@ -21,6 +22,7 @@ report_analyst/temp/*

# Coverage and test artifacts
.coverage
coverage.xml
.pytest_cache/

# Temporary files and diffs
Expand All @@ -39,6 +41,9 @@ inputlucia
# macOS system files
.DS_Store

# Development notes and assessments (keep local, don't commit)
.devnotes/

# Deployment guides and assessments (keep local, don't commit)
CLOUDFLARE_ASSESSMENT.md
CLOUDFLARE_QUICK_SUMMARY.md
Expand All @@ -50,3 +55,10 @@ THEME_SWITCHING.md
VERCEL_CHANGES_SUMMARY.md
VERCEL_DEPLOYMENT_ASSESSMENT.md
VERCEL_MIGRATION_GUIDE.md
.env
reports/

# Frontend generated artifacts
report_analyst_enterprise/components/web/node_modules/
report_analyst_enterprise/components/web/dist/
report_analyst_enterprise/components/streamlit_component/frontend/build/
281 changes: 281 additions & 0 deletions COLAB.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,281 @@
# Using report-analyst in Google Colab

This guide explains how to use the **benchmarking and diagnostics** parts of
`report-analyst` from Google Colab (or any notebook environment).

You can use it to:

- **Evaluate retrieval systems from CSV files** (Precision@K, Recall@K, F1@K,
NDCG@K, MAP, MRR).
- **Generate detailed error-analysis tables** for retrieved chunks.
- **Align heterogeneous benchmark CSVs** (e.g. ClimRetrieve) to a common
internal schema using YAML configs.
- **Run PDF chunk subset diagnostics** to understand how different chunk sizes
relate to each other.

The examples below focus on **offline evaluation and diagnostics**. They do not
run the Streamlit UI or the full backend.

## 1. Installation in Colab

Install directly from GitHub. For a clean install (e.g. after changing branches), you can uninstall and clear the pip cache first:

```python
!pip uninstall report-analyst -y
!pip cache purge
```

Then install from your repo and branch (replace `<org>`, `<repo>`, and `<branch>` with your values):

```python
# Public repo – no token needed
!pip install --no-cache-dir "git+https://github.com/<org>/<repo>.git@<branch>#subdirectory=report-analyst"
```

If the repo is **private**, use a GitHub personal access token. Store it in Colab secrets or as an environment variable (e.g. `GITHUB_TOKEN`) and do **not** commit it:

```python
import os
token = os.environ.get("GITHUB_TOKEN", "") # or from Colab secrets
if token:
!pip install --no-cache-dir "git+https://{token}@github.com/<org>/<repo>.git@<branch>#subdirectory=report-analyst"
else:
!pip install --no-cache-dir "git+https://github.com/<org>/<repo>.git@<branch>#subdirectory=report-analyst"
```

If the package lives at the repo root (no `report-analyst` subdirectory), omit `#subdirectory=report-analyst`.

Optional boilerplate imports:

```python
import pandas as pd
from pathlib import Path

from report_analyst.core.benchmark.evaluation_engine import EvaluationEngine
from report_analyst.core.benchmark.retrieval_results_loader import load_flexible_dataset_from_csv
from report_analyst.core.benchmark import error_analysis
from report_analyst.core.benchmark.dataset_mapper import (
DatasetMapperFactory,
list_available_dataset_ids,
)
```

If your CSVs are on Google Drive, mount it first:

```python
from google.colab import drive
drive.mount("/content/drive")
```

### Loading datasets from a Git repository

You can load ground truth and benchmark CSVs from raw GitHub URLs. Use **raw** URLs (e.g. `raw.githubusercontent.com`), not blob/view URLs, so you get CSV content, not HTML.

**Public repo:**

```python
import requests

GROUND_TRUTH_URL = "https://raw.githubusercontent.com/<org>/<repo>/<branch>/path/to/your_ground_truth.csv"
BENCHMARK_URL = "https://raw.githubusercontent.com/<org>/<repo>/<branch>/path/to/your_benchmark.csv"

ground_truth_csv = requests.get(GROUND_TRUTH_URL).text
benchmark_csv = requests.get(BENCHMARK_URL).text
```

**Private repo:** pass your token in the `Authorization` header (use Colab secrets or env, never hardcode):

```python
import os
import requests

token = os.environ.get("GITHUB_TOKEN", "") # set in Colab environment or secrets
headers = {"Authorization": f"token {token}"} if token else {}

ground_truth_csv = requests.get(GROUND_TRUTH_URL, headers=headers).text
benchmark_csv = requests.get(BENCHMARK_URL, headers=headers).text
```

Then pass these **strings** to `load_flexible_dataset_from_csv` using `csv_content=`, as in section 2.

**Alternative – clone the repo:** if you prefer a local path, clone the repo (with token in the URL for private repos) and pass file paths to the loader instead of `csv_content`.

## 2. Core workflow: evaluate retrieval from CSVs

Load ground truth and benchmark from CSV **content** (strings from URLs or file reads), then evaluate with the flexible-dataset API. Use `compare_flexible_datasets`, not `compare_datasets`.

```python
from pathlib import Path
import pandas as pd

from report_analyst.core.benchmark.evaluation_engine import EvaluationEngine
from report_analyst.core.benchmark.retrieval_results_loader import load_flexible_dataset_from_csv

# If you loaded CSV strings from URLs (section 1), use csv_content=
ground_truth_ds = load_flexible_dataset_from_csv(
csv_content=ground_truth_csv,
dataset_name="ground_truth",
)
benchmark_ds = load_flexible_dataset_from_csv(
csv_content=benchmark_csv,
dataset_name="benchmark",
)

# If you have local paths instead:
# ground_truth_ds = load_flexible_dataset_from_csv(csv_path="ground_truth_aligned.csv", dataset_name="ground_truth")
# benchmark_ds = load_flexible_dataset_from_csv(csv_path="benchmark_aligned.csv", dataset_name="benchmark")

engine = EvaluationEngine()
k_values = [1, 3, 5, 10]
metrics = engine.compare_flexible_datasets(ground_truth_ds, benchmark_ds, k_values=k_values)

# Build a metrics table (pivot)
rows = []
for k, v in metrics.precision_at_k.items():
rows.append({"k": k, "metric": "precision", "value": v})
for k, v in metrics.recall_at_k.items():
rows.append({"k": k, "metric": "recall", "value": v})
for k, v in metrics.f1_at_k.items():
rows.append({"k": k, "metric": "f1", "value": v})
for k, v in metrics.ndcg_at_k.items():
rows.append({"k": k, "metric": "ndcg", "value": v})
rows.append({"k": None, "metric": "MAP", "value": metrics.mean_average_precision})
rows.append({"k": None, "metric": "MRR", "value": metrics.mean_reciprocal_rank})

metrics_df = pd.DataFrame(rows)
metrics_df.pivot(index="metric", columns="k", values="value").round(4)
```

## 3. Error analysis in notebooks

You can build a detailed, per-chunk error-analysis table similar to what the
Streamlit app exports, but directly inside a notebook.

Use the flexible version of the helper so you can pass the same
`BenchmarkDataset` objects used above:

```python
from report_analyst.core.benchmark import error_analysis

top_k = 10
df_error = error_analysis.build_error_analysis_dataframe_from_flexible(
ground_truth_dataset=ground_truth_ds,
benchmark_dataset=benchmark_ds,
top_k=top_k,
)

# Inspect the first rows
display(df_error.head())

# Example: show only non-relevant retrieved chunks for manual inspection
df_false_positives = df_error[df_error["is_really_relevant"] == False]
display(df_false_positives.head())
```

Columns in `df_error` include (depending on your data):

- `report_name`, `question_id`, `question`
- `relevant_part_text` (from ground truth)
- `retrieved_chunk_text` (from benchmark)
- `position_in_top_k`, `model_score`
- `expert_relevance_label`, `is_really_relevant`
- `chunk_id`, `query_id` (identifiers)

## 4. Dataset alignment via DatasetMapper

If your raw CSVs do not yet match the internal benchmark schema, you can align
them using the same mapping logic that powers the CLI and Streamlit UI.

Each dataset has a YAML config under `report_analyst/config/datasets/`. For
example, `climretrieve.yaml` defines how to map the ClimRetrieve CSVs.

```python
import pandas as pd
from report_analyst.core.benchmark.dataset_mapper import (
DatasetMapperFactory,
list_available_dataset_ids,
)

print("Available dataset mapping IDs:", list_available_dataset_ids())

dataset_id = "climretrieve" # or another ID present in config/datasets
mapper = DatasetMapperFactory.get_mapper(dataset_id)

# Raw CSVs from the original benchmark
df_gt_raw = pd.read_csv("climretrieve_ground_truth.csv")
df_bm_raw = pd.read_csv("climretrieve_benchmark.csv")

# Align to the internal schema
df_gt_aligned = mapper.align_ground_truth(df_gt_raw)
df_bm_aligned = mapper.align_benchmark(df_bm_raw)

display(df_gt_aligned.head())
display(df_bm_aligned.head())

# Optionally, save aligned CSVs for later reuse
df_gt_aligned.to_csv("ground_truth_aligned.csv", index=False)
df_bm_aligned.to_csv("benchmark_results_aligned.csv", index=False)
```

You can then feed the aligned CSVs into the evaluation workflow described in
section 2.

## 5. Chunk subset analysis (optional diagnostic)

You can inspect how different chunk sizes relate to each other for a given PDF,
using the same `SentenceSplitter` logic as the analyzer but without embeddings.
This is independent from the IR metrics in `EvaluationEngine`, but can help you
understand whether smaller chunks are strict refinements or often fully
contained inside larger chunks.

### From Python

```python
from report_analyst.core.benchmark.chunk_subset_analysis import (
analyze_pdf_chunk_subsets,
analyze_multiple_pdfs_chunk_subsets,
)

# Single PDF, subset ratios between sizes 250, 440, 770
result = analyze_pdf_chunk_subsets(
pdf_path="my_report.pdf",
chunk_sizes=[250, 440, 770],
chunk_overlap=20,
)

# result.summary is a pandas DataFrame with:
# chunk_size_small, chunk_size_large, num_small, num_large,
# num_small_subsets, subset_ratio
display(result.summary)

# Multiple PDFs (aggregated statistics across all files)
multi_result = analyze_multiple_pdfs_chunk_subsets(
pdf_paths=["report1.pdf", "report2.pdf"],
chunk_sizes=[250, 440, 770],
chunk_overlap=20,
)
display(multi_result.summary)
```

## 6. Scripts and local usage

Scripts in `scripts/` (for example `align_benchmark_datasets.py`,
`evaluate_benchmark_from_csv.py`, `find_missing_relevant_parts.py`) are **not**
installed as commands when you `pip install` the library.

To run them directly you should:

1. Clone the repository locally.
2. Create a virtual environment and install dependencies.
3. Run the scripts from the project root, for example:

```bash
python scripts/evaluate_benchmark_from_csv.py \
--reference path/to/ground_truth_aligned.csv \
--input path/to/benchmark_results_aligned.csv \
--k-values 1 3 5 10
```

In Colab, it is usually more convenient to **call the underlying Python
functions directly** (as shown in the examples above) instead of invoking these
scripts via shell commands.
Loading