This document provides comprehensive information about PromptDiff, including advanced usage, configuration options, and detailed explanations.
- Run prompts against two model configs
- Normalize outputs
- Compare using:
- Text diff
- Embedding similarity
- (Optional) LLM judge
- Generate a report
- Model upgrades
- Prompt tuning
- Regression testing
- AI QA pipelines
PromptDiff supports two comparison modes:
Compare two models (baseline vs candidate) in a one-to-one comparison.
Use when:
- Testing a new model version against a baseline
- Quick iteration between two specific models
- CI/CD regression testing
- Focused, detailed analysis
Example:
promptdiff run \
--prompts prompts.json \
--baseline ollama:llama3 \
--candidate ollama:granite4Output: One comparison report showing detailed differences between the two models.
Compare multiple models pairwise (each model compared with every other model).
Use when:
- Evaluating 3+ model options simultaneously
- Choosing between multiple models
- Comprehensive model benchmarking
- Research and analysis
Example:
promptdiff compare \
--prompts prompts.json \
--models ollama:llama3,ollama:granite4,ollama:qwen2.5 \
--names llama3,granite4,qwen2.5 \
--output-dir resultsOutput: Combined report with all pairwise comparisons:
- llama3 vs granite4
- llama3 vs qwen2.5
- granite4 vs qwen2.5
Performance Note: Multi-model comparison performs N×(N-1)/2 comparisons:
- 3 models = 3 comparisons
- 4 models = 6 comparisons
- 5 models = 10 comparisons
PromptDiff supports multiple model providers:
# OpenAI (default)
--baseline gpt-4
--baseline openai:gpt-4
# Anthropic
--candidate anthropic:claude-3-opus
--candidate anthropic:claude-3-sonnet
# Ollama (local models)
--baseline ollama:llama3
--candidate ollama:granite4
--candidate ollama:qwen2.5
# Local (stub implementation)
--baseline local:my-modelPromptDiff supports loading configuration from a .env file for convenience.
-
Copy the example file:
cp .env.example .env
-
Edit
.envand add your API keys:OPENAI_API_KEY=sk-your-key-here ANTHROPIC_API_KEY=sk-ant-your-key-here OLLAMA_BASE_URL=http://localhost:11434
-
The
.envfile is automatically loaded when you run PromptDiff.
You can also set environment variables directly:
Windows (PowerShell):
$env:OPENAI_API_KEY="sk-..."
$env:ANTHROPIC_API_KEY="sk-ant-..."
$env:OLLAMA_BASE_URL="http://localhost:11434"Linux/Mac:
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export OLLAMA_BASE_URL="http://localhost:11434"Note: Environment variables take precedence over .env file values.
Before using Ollama models, make sure:
-
Ollama is installed and running:
# Install Ollama from https://ollama.ai ollama serve -
Pull the models you want to compare:
ollama pull llama3 ollama pull granite4 ollama pull qwen2.5
-
Run comparisons:
# Compare two Ollama models promptdiff run \ --prompts examples/prompts.json \ --baseline ollama:llama3 \ --candidate ollama:granite4 # Or use the provided comparison script python examples/compare_ollama_models.py
promptdiff run \
--prompts prompts.json \
--baseline gpt-4 \
--candidate gpt-4.1 \
--output results.json \
--embedding-model sentence-transformers/all-MiniLM-L6-v2 \
--temperature 0.7 \
--max-tokens 1000PromptDiff includes a modern web interface for interactive comparisons:
# Launch the UI
promptdiff ui
# Or specify host and port
promptdiff ui --host 0.0.0.0 --port 8501The UI provides:
- 📝 Interactive prompt loading - Upload JSON files, type/paste JSON, use examples, or build manually
- 🤖 Model configuration - Always-visible model selection in sidebar
- 🔍 Single comparison mode - Compare two models with detailed side-by-side diffs
- 🔬 Multi-model comparison mode - Compare multiple models pairwise with combined reports
- 📊 Visual results - Charts, metrics, and similarity scores
- 📈 Summary statistics - Aggregate metrics across all prompts
- 💾 Download reports - Export markdown reports and JSON results
- ✅ Selective examples - Choose specific example prompts instead of loading all
Then open your browser to http://localhost:8501 for the interactive interface!