A comprehensive framework for creating structured image search evaluation datasets. This tool automates the entire pipeline from image preprocessing to dataset upload, making it easy to build high-quality benchmarks for image retrieval systems.
- Complete Pipeline Automation: End-to-end workflow from raw images to published datasets
- Flexible Adapter System: Pluggable adapters for vision annotation, relevance judging, and similarity scoring
- Batch Processing: Efficient batch processing for large-scale datasets.
- Query Planning: Intelligent query generation with diversity and difficulty balancing
- Comprehensive Analysis: Automatic generation of dataset summaries, statistics, and visualizations
- Hugging Face Integration: Direct upload to Hugging Face Hub with dataset cards and summaries
- Cost Tracking: Monitor API costs throughout the pipeline
- Progress Tracking: Built-in progress bars and logging for long-running operations
pip install imsearch-benchmakerInstall with specific adapters:
# OpenAI adapters (for vision annotation and relevance judging)
pip install imsearch-benchmaker[openai]
# Local CLIP adapter (for similarity scoring)
pip install imsearch-benchmaker[local]
# All adapters
pip install imsearch-benchmaker[all]git clone https://github.com/waggle-sensor/imsearch_benchmaker.git
cd imsearch_benchmaker
pip install -e .Create a config.toml file with your benchmark settings:
# Basic configuration
benchmark_name = "MyBenchmark"
benchmark_description = "A benchmark for image search"
log_level = "INFO"
# File paths
image_root_dir = "/path/to/images"
images_jsonl = "outputs/images.jsonl"
annotations_jsonl = "outputs/annotations.jsonl"
query_plan_jsonl = "outputs/query_plan.jsonl"
qrels_jsonl = "outputs/qrels.jsonl"
summary_output_dir = "outputs/summary"
hf_dataset_dir = "outputs/hf_dataset"
# Vision adapter configuration
[vision_config]
adapter = "openai"
model = "gpt-4o"
# Judge adapter configuration
[judge_config]
adapter = "openai"
model = "gpt-4o"
# Similarity adapter configuration
[similarity_config]
adapter = "local_clip"
model_name = "openai/clip-vit-base-patch32"benchmaker all --config config.toml #or use IMSEARCH_BENCHMAKER_CONFIG_PATH env variable with the path to the config fileThis will run the entire pipeline:
- Preprocess: Build
images.jsonlfrom your image directory - Vision: Annotate images with tags, taxonomies, and metadata
- Query Plan: Generate diverse queries with candidate images
- Judge: Evaluate relevance of candidates for each query
- Postprocess: Calculate similarity scores and generate summaries
- Upload: Upload dataset to Hugging Face Hub
Configuration is done via TOML files (JSON is also supported). The framework uses a BenchmarkConfig class that supports:
- Benchmark metadata: Name, description, author information
- Column mappings: Customize column names for your data structure
- Column Names: All fields starting with
column_orcolumns_define dataset column names
- Column Names: All fields starting with
- File paths: Input and output file locations
- Metadata JSONL: Optional path to metadata JSONL file for merging additional metadata into
images.jsonl
- Metadata JSONL: Optional path to metadata JSONL file for merging additional metadata into
- Adapter settings: Configure vision, judge, and similarity adapters
- Vision metadata columns: List of columns to extract into
VisionImage.metadatafor prompt interpolation
- Vision metadata columns: List of columns to extract into
- Query planning: Control query generation parameters
- Hugging Face: Repository settings for dataset upload
- Logging: Control logging level
See example/config.toml for a complete configuration example or check the BenchmarkConfig class documentation for more details.
Fields starting with _ (e.g., _hf_token, _openai_api_key) are considered sensitive fields.
The rights_map.json file (configured via meta_json in your config) allows you to assign license, DOI, and dataset name metadata to images during preprocessing. This is useful when images come from multiple sources with different licensing requirements and you want to track which original dataset each image came from.
The rights_map.json file has the following structure:
{
"default": {
"license": "UNKNOWN",
"doi": "UNKNOWN",
"dataset_name": "UNKNOWN"
},
"files": {
"path/to/specific/image.jpg": {
"license": "CC BY 4.0",
"doi": "10.1234/example",
"dataset_name": "CustomDataset"
}
},
"prefixes": [
{
"prefix": "sage/",
"license": "UNKNOWN",
"doi": "10.1109/ICSENS.2016.7808975",
"dataset_name": "Sage"
},
{
"prefix": "wildfire/",
"license": "CC BY 4.0",
"doi": "10.3390/f14091697",
"dataset_name": "Wildfire"
}
]
}Metadata is assigned to images using the following priority order (most specific first):
- Exact file match: If an image ID appears in the
filesobject, use that metadata - Longest prefix match: If the image ID starts with any prefix in the
prefixesarray, use the metadata from the longest matching prefix - Default: Use the metadata from the
defaultobject
If dataset_name is missing or set to "UNKNOWN" in the rights map, the framework will automatically extract the dataset name from the image ID:
- If the image ID contains a
/separator (e.g.,sage/imagesampler-bottom-2726/image.jpg), it extracts the prefix before the first/(e.g.,sage) - If the image ID has no
/separator, it uses"UNKNOWN"as the dataset name
This allows you to track which original datasets contributed to your benchmark, which is useful for generating dataset proportion visualizations.
For an image with ID sage/imagesampler-bottom-2726/image.jpg:
- If
filescontains an exact match, use that metadata (includingdataset_name) - Otherwise, if it starts with
sage/, use thesage/prefix metadata (includingdataset_name: "Sage") - Otherwise, use the
defaultmetadata - If
dataset_nameis still"UNKNOWN"or missing, extract"sage"from the image ID prefix
Set the path to your rights map file in config.toml:
meta_json = "path/to/rights_map.json"Or pass it via command line:
benchmaker preprocess --meta-json path/to/rights_map.jsonIf no meta_json is provided, you'll be prompted for default license, DOI, and dataset name values during preprocessing.
The dataset name is stored in the original_dataset_name column (configurable via column_original_dataset_name in your config) and is used to generate dataset proportion visualizations in the summary output.
See example/rights_map.json for a complete example.
If you have existing metadata (e.g., human-annotated labels, categories, or other pre-existing annotations) that you want to pass to the vision model to help guide its annotations, you can use the metadata extraction and prompt interpolation feature.
Create a metadata.jsonl file where each row contains an image_id plus any additional metadata columns you want to merge:
{"image_id": "sage/image1.jpg", "existing_label": "wildfire", "category": "outdoor"}
{"image_id": "sage/image2.jpg", "existing_label": "smoke", "category": "outdoor"}
{"image_id": "wildfire/image3.jpg", "existing_label": "flame", "category": "emergency"}Set the path to your metadata JSONL file in config.toml:
metadata_jsonl = "inputs/metadata.jsonl"During preprocessing, the metadata from this file will be merged into images.jsonl by matching image_id values.
Specify which columns from images.jsonl should be extracted into VisionImage.metadata for use in vision model prompts:
[vision_config]
vision_metadata_columns = ["existing_label", "category"]Write template placeholders in your vision model prompts (in config.toml) to include the metadata:
[vision_config]
system_prompt = """You are labeling images for a retrieval benchmark.
This image has existing label: {metadata.existing_label}
Category: {metadata.category}
Use this information to guide your annotations."""
user_prompt = """Analyze the image and output JSON with:
- summary: <= 30 words, factual, no speculation
- tags: choose 12-18 tags from the provided enum list
- confidence: 0..1 per field"""The framework will automatically interpolate {metadata.column_name} placeholders with the actual values from the metadata when building vision requests.
- Create
metadata.jsonlwithimage_idand your metadata columns - Set
metadata_jsonl = "inputs/metadata.jsonl"in config - Run
benchmaker preprocess- metadata will be merged intoimages.jsonl - Set
vision_metadata_columns = ["existing_label", "category"]in[vision_config] - Add
{metadata.existing_label}and{metadata.category}placeholders in your prompts - Run
benchmaker vision- the vision model will receive the metadata in its prompts
This allows you to leverage existing annotations or metadata to help the vision model produce more accurate or consistent annotations.
# Set the path to the config file so you don't have to pass it to each command
export IMSEARCH_BENCHMAKER_CONFIG_PATH="path/to/config.toml"
# Run complete pipeline
benchmaker all
# Individual steps
benchmaker preprocess
benchmaker vision
benchmaker plan
benchmaker judge
benchmaker postprocess similarity
benchmaker postprocess summary
benchmaker postprocess add-metadata # (optional) add vision_metadata_columns to qrels
benchmaker upload# Check if image URLs are reachable
benchmaker check-urls --images-jsonl outputs/images.jsonl
# Clean intermediate files
benchmaker clean --config config.toml
# List OpenAI batches
benchmaker list-batches --config config.tomlFor more control over the vision annotation process:
# Set the path to the config file so you don't have to pass it to each command
export IMSEARCH_BENCHMAKER_CONFIG_PATH="path/to/config.toml"
# Create batch input
benchmaker vision-make
# Submit batch
benchmaker vision-submit
# Wait for completion
benchmaker vision-wait
# Download results
benchmaker vision-download
# Parse results
benchmaker vision-parse
# Retry failed requests
benchmaker vision-retrySimilar granular commands are available for the judge step:
export IMSEARCH_BENCHMAKER_CONFIG_PATH="path/to/config.toml"
benchmaker judge-make
benchmaker judge-submit
benchmaker judge-wait
benchmaker judge-download
benchmaker judge-parse
benchmaker judge-retryThe framework uses an adapter pattern for extensibility. Adapters are automatically discovered and registered. You can use different adapters for different tasks simultaneously - for example, OpenAI for vision annotation and Google Gemini for relevance judging. Simply configure each adapter in your config.toml file.
- OpenAI: Uses OpenAI API for image annotation with structured outputs
- Tags, taxonomies, boolean fields
- Confidence scores
- Controlled vocabularies
- OpenAI: Uses OpenAI API to evaluate query-image relevance
- Binary and graded relevance labels
- Confidence scores
- Local CLIP: Local CLIP models for similarity scoring
- Supports any CLIP model from Hugging Face
- No API costs
The framework supports creating custom adapters for vision annotation, relevance judging, and similarity scoring. Adapters are discovered when placed in the imsearch_benchmaker/adapters/ directory and registered in the imsearch_benchmaker/adapters/init.py file. For detailed instructions, code examples, and best practices, see Creating Custom Adapters.
You can use different adapters for different tasks in the same pipeline. For example:
# Use OpenAI for vision annotation
[vision_config]
adapter = "openai"
model = "gpt-4o"
# Use Google Gemini for relevance judging
[judge_config]
adapter = "gemini"
model = "gemini-pro"
# Use local CLIP for similarity scoring
[similarity_config]
adapter = "local_clip"
model_name = "openai/clip-vit-base-patch32"Each adapter is configured independently, allowing you to choose the best service for each task based on cost, performance, or feature requirements.
┌─────────────┐
│ Images │
└──────┬──────┘
│
▼
┌─────────────┐ ┌─────────────┐
│ Preprocess │────▶│ images.jsonl│
└─────────────┘ └──────┬───────┘
│
▼
┌─────────────┐
│ Vision │────▶ annotations.jsonl
└──────┬──────┘
│
▼
┌─────────────┐
│ Query Plan │────▶ query_plan.jsonl
└──────┬──────┘
│
▼
┌─────────────┐
│ Judge │────▶ qrels.jsonl
└──────┬──────┘
│
▼
┌─────────────┐
│ Postprocess │────▶ qrels_with_score.jsonl
└──────┬──────┘ + summary/
│
▼
┌─────────────┐
│ Upload │────▶ Hugging Face Hub
└─────────────┘
- Preprocess: Converts raw images into JSONL format with metadata (image IDs, licenses, DOIs)
- Vision Annotation: Uses vision adapter to annotate images with summaries, tags, and categorical facets
- Query Planning: Selects seed images and creates candidate pools (positives, neutrals, hard/easy negatives) for each query that will be created by the judge adapter. Use
query_plan_pos_totalfor positives (all facets match) andquery_plan_neutral_totalfor neutrals (one facet off) candidates. - Judge Generation: Uses judge adapter to generate queries and assign binary relevance labels for each candidate image in the query's candidate pool.
- Postprocessing: Computes similarity scores for all query-image pairs using the similarity adapter. Also, generates exploratory data analysis visualizations and statistics.
- Hugging Face Upload: Prepares the dataset in Hugging Face format for publication and uploads to the Hugging Face dataset repository.
images.jsonl: Image metadata and URLsseeds.jsonl: Seed images for query generationannotations.jsonl: Vision annotations (tags, taxonomies, etc.)query_plan.jsonl: Generated queries with candidate imagesqrels.jsonl: Relevance judgments (query-image pairs)
qrels_with_score_jsonl: QRELs with similarity scoressummary/: Directory containing:- Dataset statistics (CSV)
- Visualizations (PNG):
- Dataset proportion donut chart (showing percentage breakdown by original dataset)
- Image proportion donuts (for taxonomy columns)
- Query relevancy distributions
- Relevance overview charts
- Similarity score analysis
- Confidence analysis
- Cost summaries
- Word clouds
hf_dataset/: Hugging Face dataset ready for upload- Row order per query: For each
query_id, the first row is always the seed image (the ground-truth match for the query). The metadata in that first row (e.g.summary,tags, taxonomy fields) is the matching metadata for the query. All following rows for the samequery_idare candidate images; their metadata is the metadata for that candidate image. This is important when using the dataset for evaluation. As an example, if you are using the dataset for evaluation, you can use the first row of each query as the ground-truth match for the query and the following rows as candidate images.
- Row order per query: For each
See the example/ directory for:
- Complete configuration file (
config.toml) - Sample input files
- Example outputs
- Dataset card template
You can also see more examples in imsearch_benchmarks repository. This repository contains the benchmarks that were created using this framework to be used in Sage Image Search.
- Python >= 3.11
- Core dependencies (automatically installed):
- see
imsearch_benchmaker/requirements.txt
- see
- Optional adapter dependencies:
- see
imsearch_benchmaker/adapters/{adapter_name}/requirements.txt
- see
Combine imsearch_benchmaker and imsearch_eval to create a complete pipeline for image search evaluation. imsearch_benchmaker creates the benchmarks and imsearch_eval uses them to evaluate the performance of the image search system.
Contributions are welcome! Please feel free to submit a Pull Request.
- Author: Francisco Lozano
- Email: francisco.lozano@northwestern.edu
- Affiliation: Northwestern University
- GitHub: FranciscoLozCoding
For issues, questions, or contributions, please open an issue on GitHub.
If you use this framework in your research, please cite:
@software{imsearch_benchmaker,
title = {Image Search Benchmark Maker},
author = {Lozano, Francisco},
organization = {Northwestern University},
orcid = {0009-0003-8823-4046},
year = {2026},
url = {https://github.com/waggle-sensor/imsearch_benchmaker}
}- fix bug where if you run vision-submit with a file input in cli, the batch id is saved in the same directory as the file input but it should be saved in the same place as if when config.toml is used
- Add pytest and create a testing pipeline
- Add support for more adapters