This guide explains how SpatialBench evaluations are structured and how they work. The 7 example evaluations in this repository are fixed to demonstrate the format. To contribute to the full 98-evaluation benchmark, contact kenny@latch.bio.
The 7 evaluations in this repository demonstrate:
- Tasks from real-world analysis workflows
- Reproducible ground truth
- Appropriate grader selection
- Clear task descriptions with specified output format
- Local testing procedures
- Comprehensive documentation
The complete SpatialBench comprises 98 evaluations across four platforms:
| Technology | Evaluations |
|---|---|
| Xenium | 30 |
| Vizgen (MERFISH) | 31 |
| AtlasXomics | 25 |
| Seeker/Curio | 12 |
| Total | 98 |
This repository contains 7 representative examples to demonstrate the evaluation format and grading system. To contribute to the full benchmark, contact kenny@latch.bio.
Good evaluation tasks in SpatialBench:
- Represent real analysis workflows
- Have clear, objective ground truth
- Test specific agent capabilities
- Span diverse difficulty levels
Examples from the benchmark:
- Calculate mean gene count per cell (QC metrics)
- Perform Leiden clustering with specific parameters
- Identify cell populations using marker genes
SpatialBench datasets:
- Are in standard format (.h5ad for most cases)
- Have necessary metadata (cell types, clusters, spatial coords)
- Are properly preprocessed (or evaluations test preprocessing)
- Are hosted in cloud storage with latch:// URIs
Example data node:
{
"data_node": "latch://38438.account/xenium_kidney.h5ad"
}Ground truth is established by running analyses with standard tools:
import scanpy as sc
import numpy as np
adata = sc.read_h5ad("dataset.h5ad")
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.pca(adata, n_comps=50)
ground_truth = {
"n_cells": int(adata.n_obs),
"n_genes": int(adata.n_vars),
"mean_counts": float(adata.X.sum(axis=1).mean()),
"n_pcs": int(adata.obsm['X_pca'].shape[1])
}
print(ground_truth)Document everything:
- Package versions (
scanpy==1.9.1) - Random seeds (if applicable)
- Parameter choices and rationale
- References to papers/protocols
Match your task to the appropriate grader:
| Output Type | Grader | Example |
|---|---|---|
| Numbers | NumericTolerance | {"mean": 45.2} |
| Label set | LabelSetJaccard | {"labels": ["A", "B"]} |
| Percentages | DistributionComparison | {"dist": {"A": 30, "B": 70}} |
| Ranked list | MarkerGenePrecisionRecall | {"genes": ["G1", "G2"]} |
| Expression scores | MarkerGeneSeparation | {"auroc": 0.85, "per_gene": [...]} |
| Spatial metrics | SpatialAdjacency | {"median_dist": 18.5} |
See graders.md for full documentation.
Make it crystal clear:
Bad:
"Cluster the cells and tell me how many clusters there are."
Good:
"Perform Leiden clustering on the preprocessed data using resolution=1.0.
Ensure resulting clusters are largely independent of sample batch
(adjusted mutual information with 'sample' column < 0.3).
Return JSON with fields: num_clusters (int), cluster_labels (list of unique
cluster IDs), ami_with_sample (float between 0 and 1)."
Key elements:
- Specific method (Leiden, not just "clustering")
- Parameters (resolution=1.0)
- Constraints (AMI < 0.3)
- Exact output format with field names and types
Example evaluation file structure:
{
"id": "platform_task_description_v1",
"task": "Your detailed task description here...",
"data_node": "latch://path/to/dataset.h5ad",
"grader": {
"type": "grader_name",
"config": {
"ground_truth": {
...
},
"tolerances": {
...
}
}
}
}Naming convention:
id:xenium_clustering_leiden_v1- File:
evals/clustering/xenium_clustering_leiden.json
Create a test agent:
from spatialbench import EvalRunner
from pathlib import Path
import json
def test_agent(task_prompt, work_dir):
import scanpy as sc
adata_files = list(work_dir.glob("*.h5ad"))
adata = sc.read_h5ad(adata_files[0])
sc.pp.pca(adata, n_comps=50)
answer = {
"n_pcs": adata.obsm['X_pca'].shape[1]
}
answer_file = work_dir / "eval_answer.json"
answer_file.write_text(json.dumps(answer))
return answer
runner = EvalRunner(
"evals/preprocessing/my_new_eval.json",
keep_workspace=True
)
result = runner.run(agent_function=test_agent)
print(f"Passed: {result['passed']}")Test both cases:
- Correct answer → should pass
- Incorrect answer → should fail
def wrong_agent(task_prompt, work_dir):
answer = {"n_pcs": 999}
(work_dir / "eval_answer.json").write_text(json.dumps(answer))
return answer
result = runner.run(agent_function=wrong_agent)
assert not result['passed'], "Grader should fail with wrong answer!"Each evaluation in evals/README.md is documented with:
#### platform_task_description
**Platform**: Xenium/Vizgen/Seeker
**Grader**: grader_name
Description of what this eval tests and why it's important.
**Key Metrics**: field1, field2, field3
**Platform Notes**: Specific considerations for this platform
**Challenges**: What makes this task difficultTo contribute new evaluations to the full 98-evaluation benchmark:
- Review the 7 examples in this repository to understand format and quality standards
- Contact kenny@latch.bio with your proposal
- Provide:
- Task description and rationale
- Ground truth establishment method
- Dataset information
- Grader configuration
- Validation of reproducibility
The full benchmark is maintained separately to prevent overfitting and ensure performance metrics reflect genuine spatial biology reasoning capabilities.
Problem: Agent doesn't know exact output format
{
"task": "Find the major cell types"
}Solution: Specify exact fields
{
"task": "Identify major cell types and return JSON with field: cell_types (list of strings)"
}Problem: Ground truth not reproducible enough
{
"mean_expression": 45.234567,
"tolerance": 0.00001
}Solution: Use reasonable tolerances
{
"mean_expression": 45.2,
"tolerance": 2.0
}Problem: Don't test with wrong answers
Solution: Always test failure cases
Problem: Can't reproduce analysis
Solution: Document all parameters, seeds, versions
Some evals need multiple criteria:
{
"grader": {
"type": "multi_criteria",
"graders": [
{
"type": "numeric_tolerance",
"weight": 0.5,
"config": {...}
},
{
"type": "label_set_jaccard",
"weight": 0.5,
"config": {...}
}
]
}
}Note: This requires implementing a MultiCriteriaGrader - see custom grader docs.
- Start simple: Don't combine too many requirements in one eval
- Be specific: Exact field names, data types, constraints
- Document well: Future you will thank you
- Test thoroughly: Both pass and fail cases
- Consider difficulty: Mix easy, medium, and hard evals
- Real-world relevance: Does this test actual analysis skills?
- Reproducibility: Can someone else verify your ground truth?
The 7 representative evaluations in this repository:
- QC metrics:
evals/qc/seeker_qc_basic.json - Preprocessing:
evals/preprocessing/xenium_normalization.json - Clustering:
evals/clustering/xenium_leiden.json - Cell typing:
evals/cell_typing/xenium_kidney_typing.json - Differential expression:
evals/differential_expression/vizgen_de_temporal.json - Spatial analysis:
evals/spatial_analysis/seeker_spatial_contiguity.json,evals/spatial_analysis/vizgen_tissue_composition.json
- Study the 7 example evaluations for format and structure
- Read specification.md for technical format details
- Read graders.md for available grader types
- See BENCHMARK_SCOPE.md for full benchmark information
- Contact kenny@latch.bio to contribute to the full benchmark