A benchmark tool for evaluating how well multimodal embedding models align text and visual representations of colors.
- 🔄 Async Processing: Efficient parallel fetching of embeddings with automatic batching
- 🔍 OpenAPI Validation: Automatically validates endpoints against
/openapi.jsonschema - 📦 Smart Batching: Auto-discovers batch support and optimal batch sizes (4, 8, 16, 32, 64, 128, 256, 512, 1024)
- 💾 Intelligent Caching: Per-model caching to avoid redundant API calls
- 📊 TSV Results: Persistent results tracking with timestamp, mean/median/std metrics
- 🎨 Interactive CLI: Menu-driven interface with questionary
-
Install dependencies:
uv sync
-
Configure environment:
cp .env.example .env # Edit .env to add your API keys (e.g., OPENAI_API_KEY)
uv run color-perception-benchThe local-default model is auto-created on first run (pointing to http://localhost:8080).
To add an OpenAI-compatible model:
- Select Manage Models → Add Model
- Name:
openai-text-3-large(example) - Provider type:
openai_compatible - Base URL:
https://api.openai.com - Endpoints:
/v1/embeddings(for both text and image usually, or specific ones) - API Key Env Var:
OPENAI_API_KEY
- Select Run Benchmark
- Select models using Space, confirm with Enter.
- Choose whether to force refresh the cache.
- Watch the progress bars.
- Select View Last Results in the CLI.
- Or view the raw file:
cat benchmark_results.tsv
Edit .env to store your secrets:
OPENAI_API_KEY=sk-...
TOGETHER_API_KEY=...Models are stored in models.yaml (git-tracked).
local: For custom APIs or localhost servers with OpenAPI specs.openai_compatible: For OpenAI, Together AI, Anyscale, Fireworks, Replicate, etc.
src/color_perception_bench/
├── providers/
│ ├── base.py # AsyncEmbeddingProvider protocol
│ ├── local.py # Local API provider
│ └── openai_compatible.py # OpenAI-style API provider
├── benchmark.py # Async benchmark runner
├── cache.py # Per-model caching layer
├── cli.py # Interactive menu interface
├── registry.py # Model configuration management
├── colors.py # XKCD color data (949 colors)
└── experiment.py # Original POC (legacy)
The benchmark computes cross-modal alignment between text and image embeddings for the same color:
| Metric | Description | Interpretation |
|---|---|---|
| Mean | Average cosine distance | Overall alignment quality |
| Median | Middle value | Robust to outliers |
| Std | Standard deviation | Consistency of alignment |
| Min | Best alignment | Best case performance |
| Max | Worst alignment | Worst case performance |
Lower distances = Better alignment between text and image embeddings.
You can also use the library programmatically:
import asyncio
from color_perception_bench import (
run_benchmark,
add_model,
list_models,
print_results_table,
)
# 1. Add a model programmatically
add_model(
name="openai-text-3-large",
provider_type="openai_compatible",
base_url="https://api.openai.com",
text_endpoint="/v1/embeddings",
image_endpoint="/v1/embeddings",
api_key_env_var="OPENAI_API_KEY",
batch_size=128
)
# 2. Run benchmark
asyncio.run(run_benchmark(["local-default", "openai-text-3-large"]))
# 3. Print results
print_results_table()- "Import could not be resolved": Run
uv syncand ensure.venvis activated. - "API key environment variable not set": Check
.envand ensure the variable name matches the config. - "Failed to fetch OpenAPI schema": Ensure the provider is running and the URL is correct (must serve
/openapi.json). - Cache not being used: Check
cache/directory. Useforce_refresh=Trueto rebuild.