Automate prompts, seeds, metrics, and reproducible run logging.
Built for AI researchers, labs, and developers to evaluate image and video diffusion models faster and compare results consistently.
⭐ Star the repo for updates ⭐
Product Vision: AI Research
DreamLayer AI is an open-source benchmarking and evaluation platform for image generation models and video generation models. It automates prompts, seeds, metrics, configs, and reproducible run logging so researchers and developers can compare model quality faster and more consistently. It runs locally with a React frontend, Flask-based services, SQLite run storage, and ComfyUI integration for image workflows.
Compare model outputs across prompts, seeds, configs, and metrics with reproducible run logging.
DreamLayer AI is built for:
- AI researchers comparing diffusion models across prompts, seeds, and metrics
- ML Engineers evaluating image and video generation quality
- Labs and teams building internal benchmarking workflows for generative models
- Open-source model creators testing checkpoints, LoRAs, and workflows
- Developers integrating custom metrics and evaluation pipelines
DreamLayer can benchmark:
- Image generation model outputs
- Video generation model outputs
- Prompt-to-image alignment
- Image quality and aesthetic quality
- Object-level prompt adherence
- Temporal video consistency
- Reference-based image and video similarity metrics
Status: ✨ Now live
Easiest way to run DreamLayer 😃
- Download this repo
- Open the folder in Cursor (an AI-native code editor)
- Type
run itor press the "Run" button — then follow the guided steps
Cursor will:
- Walk you through each setup step
- Install Python and Node dependencies
- Create a virtual environment
- Start the backend and frontend
- Output a localhost:8080 link you can open in your browser
⏱️ Takes about 5-10 minutes. No terminal needed. Just click, run, and you’re in. 🚀
On macOS, PyTorch setup may take a few retries. Just keep pressing Run when prompted. Cursor will guide you through it.
linux:
./install_linux_dependencies.shmacOS:
./install_mac_dependencies.shWindows (PowerShell):
# If needed, allow script execution for this session:
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\install_windows_dependencies.ps1linux:
./start_dream_layer.shmacOS:
./start_dream_layer.shWindows:
start_dream_layer.batinstall_dependencies_linux DLVENV_PATH // preferred path to python virtual env. default is /tmp/dlvenv
start_dream_layer DREAMLAYER_COMFYUI_CPU_MODE // if no nvidia drivers available run using CPU only. default is false
- Frontend: http://localhost:8080
- ComfyUI: http://localhost:8188
DreamLayer ships without weights to keep the download small. You have two ways to add models:
DreamLayer can also call external APIs (OpenAI DALL·E, Flux, Ideogram).
To enable them:
Edit your .env file in the repository root (./.env):
OPENAI_API_KEY=sk-...
BFL_API_KEY=flux-...
IDEOGRAM_API_KEY=id-...
STABILITY_API_KEY=sk-...Once a key is present, the model becomes visible in the dropdown. No key = feature stays hidden.
Step 1: Download .safetensors or .ckpt files from:
- Hugging Face
- Civitai
- Your own training runs
Step 2: Place the models in the appropriate folders (auto-created on first run):
- Checkpoints/ → # full checkpoints (.safetensors)
- Lora/ → # LoRA & LoCon files
- ControlNet/ → # ControlNet models
- VAE/ → # optional VAEs
Step 3: Click Settings ▸ Refresh Model List in the UI — the models appear in dropdowns.
Tip: Use symbolic links if your checkpoints live on another drive.
The installation scripts will automatically install all dependencies and set up the environment.
For FID scoring, download the CIFAR-10 reference dataset:
python scripts/fetch_datasets.pyNote: The YOLO model (
yolov8n.pt, ~6MB) for object detection metrics auto-downloads on first use.
| 🔍 Feature | 🚀 How it's better |
|---|---|
| Automated Benchmarking | One run sweeps N prompts by M seeds by K samplers. Metrics compute live during generation, so a 1 to 2 week manual benchmark finishes in 3 to 5 hours per model. |
| Reproducibility by Default | Every run persists to SQLite with prompt, negative prompt, seed, sampler, steps, CFG, model hash, LoRA stack, ControlNet config, and all computed metrics. Replay any run by run_id. |
| Image and Video Metrics, Built In | Image: CLIPScore (ViT-L/14), FID, LAION aesthetic, color harmony, sharpness, YOLOv8 composition F1. Video: FVD (I3D), SSIM, PSNR, LPIPS, temporal flickering, subject and background consistency (DINO), motion smoothness. Custom metrics pluggable. |
| Multi-Modal Today | Image and video evaluation are available out of the box. Audio benchmarking is on the roadmap. See the Metrics section below for the exact call graph and storage schema. |
| Reference-Free and Reference-Based | Works without a ground-truth image or video for CLIPScore, aesthetics, YOLO composition, temporal flickering, and DINO consistency. Add a reference video to unlock SSIM, PSNR, LPIPS. FID operates on a reference set. |
| Cached, Incremental, Comparable | Metrics persist per run in a dedicated SQLite table and return instantly on re-fetch. Batch backfill endpoints recompute missing metrics across the full history. Compare any two runs side by side via the comparison API. |
| Researcher-Friendly Exports | Run locally on your own GPU (CUDA, MPS, or CPU fallback). Export to CSV per run or a ZIP report bundle with images, metadata, and metrics for leaderboard submission or paper appendices. |
DreamLayer supports a working set of common image and video evaluation metrics, including CLIPScore, FID, aesthetic scoring, LPIPS, SSIM, PSNR, composition precision/recall/F1, temporal flickering, subject consistency, background consistency, and motion smoothness. These metrics run either automatically during generation or on demand per run, are exposed through live HTTP routes, and persist to SQLite for reproducible benchmarking and comparison.
- CLIPScore: prompt-to-image alignment using cosine similarity between CLIP text and image embeddings. Higher is better (0 to 1). No reference needed. Backbone: CLIP ViT-L/14.
- FID (Fréchet Inception Distance): distribution distance between generated images and a reference image set. CIFAR-10 ships as the default reference. Lower is better. Reference required. Backbone: Inception-V3.
- LAION Aesthetic Score: learned aesthetic quality prediction from CLIP embeddings. Higher is better (0 to 10). No reference needed. Backbone: LAION linear predictor on CLIP ViT-L/14.
- Color Harmony, Saturation Balance, Value Contrast: HSV-space color theory analysis using k-means clustering. Higher is better (0 to 1). No reference needed. Backbone: OpenCV.
- Technical Quality: sharpness, noise level, and artifact detection per image. Higher is better (0 to 1). No reference needed. Backbone: Laplacian variance plus heuristics.
- Composition Precision, Recall, F1: object-level prompt adherence, comparing detected objects against a prompt-derived object list. Higher is better (0 to 1). No reference needed. Backbone: YOLOv8n.
- FVD (Fréchet Video Distance): distribution distance between two sets of videos in I3D feature space. Lower is better. Reference required.
- Video SSIM: per-frame structural similarity, reported as mean and standard deviation across frames. Higher is better (0 to 1). Reference required.
- Video PSNR: per-frame peak signal-to-noise ratio, reported as mean and standard deviation. Higher is better (dB). Reference required.
- Video LPIPS: per-frame learned perceptual similarity between generated and reference frames. Lower is better. Reference required. Backbone: LPIPS with AlexNet.
- Temporal Flickering: frame-to-frame stability using mean absolute error between consecutive frames. Higher is better (0 to 1). No reference needed.
- Subject Consistency: how stable the main subject’s appearance is across frames. Higher is better (0 to 1). No reference needed. Backbone: DINO feature similarity.
- Background Consistency: how stable the background is across frames. Higher is better (0 to 1). No reference needed. Backbone: DINO feature similarity.
- Motion Smoothness: smoothness of optical flow between consecutive frames. Higher is better (0 to 1). No reference needed. Backbone: OpenCV optical flow.
- Per-Frame Aesthetic: LAION aesthetic score applied to each frame, reported as a mean. Higher is better (0 to 10). No reference needed. Backbone: LAION predictor on CLIP ViT-L/14.
Temporal Flickering, Subject Consistency, Background Consistency, and Motion Smoothness are adapted from VBench (CVPR 2024).
- Live during image generation: CLIPScore, LAION aesthetic, color metrics, technical quality, and YOLO composition. Results are written to the metrics table as soon as the image is saved.
- On demand for images: FID. Requires the CIFAR-10 reference stats (run
python scripts/fetch_datasets.pyonce), then a POST /api/runs/calculate-metrics call, or the batch backfill script for historical runs. - On demand for video: all video metrics. Trigger per video with POST /api/calculate-video-metrics, or batch across all unscored videos with POST /api/calculate-all-video-metrics. Results are cached to the video_metrics table and return instantly on re-fetch.
Metrics persist across three dedicated SQLite tables:
- metrics: image scalar metrics and aesthetic sub-scores
- composition_metrics: YOLO precision, recall, F1, detected objects, missing objects
- video_metrics: FVD, SSIM, PSNR, LPIPS, plus a JSON blob of per-frame arrays and VBench-style quality metrics
You can export any run or slice of runs to CSV through the report bundle endpoint, or download a ZIP containing images, prompts, configs, and every computed metric for leaderboard submissions or paper appendices.
- Python 3.8+
- Node.js 16+
- 8GB+ RAM recommended
- Star this repository.
- Share the screenshot on X ⁄ Twitter with
#DreamLayerAIto spread the word.
All contributions code, docs, art, tutorials—are welcome!
- Create a PR and follow the evidence requirements in the template.
- See CHANGELOG Guidelines for detailed contribution process.
DreamLayer AI will ship under the GPL-3.0 license when the code is released.
All trademarks and closed-source models referenced belong to their respective owners.
DreamLayer AI includes a comprehensive test suite covering all functionality including ClipScore integration, database operations, and API endpoints.
# Install test dependencies
pip install -r tests/requirements.txt
# Run all tests
python tests/run_all_tests.py
# Run specific test categories
python tests/run_all_tests.py unit # Unit tests only
python tests/run_all_tests.py integration # Integration tests only
python tests/run_all_tests.py api # API endpoint tests
python tests/run_all_tests.py clipscore # ClipScore functionality tests
# Run with verbose output
python tests/run_all_tests.py all -v| Test File | Coverage | Description |
|---|---|---|
test_txt2img_server.py |
Text-to-Image API | Tests txt2img generation and database integration |
test_img2img_server.py |
Image-to-Image API | Tests img2img generation and database integration |
test_run_registry.py |
Run Registry API | Tests database-first API with ClipScore retrieval |
test_report_bundle.py |
Report Generation | Tests Mac-compatible report bundle creation |
test_clip_score.py |
ClipScore Integration | Tests CLIP model calculation and database storage |
test_database_integration.py |
Database Operations | Tests 3-table schema and database operations |
- ✅ Unit Tests - Individual component testing
- ✅ Integration Tests - End-to-end workflow testing
- ✅ API Tests - HTTP endpoint testing with Flask test client
- ✅ Database Tests - SQLite operations with temporary test databases
- ✅ Mock Testing - External dependency mocking (ComfyUI, CLIP model)
- ✅ Error Handling - Edge cases and error condition testing
- ✅ Mac Compatibility - ZIP file generation testing
# Run specific test file
python -m pytest tests/test_clip_score.py -v
# Run specific test method
python -m pytest tests/test_clip_score.py::TestClipScore::test_clip_score_calculation_with_mock -v
# Run with coverage report
python -m pytest tests/ --cov=dream_layer_backend --cov-report=htmlThe test suite requires these additional dependencies:
pytest- Test frameworkpytest-cov- Coverage reportingpytest-mock- Mocking utilitiesrequests-mock- HTTP request mocking
Install with: pip install -r tests/requirements.txt
Yes. All five are fully implemented and persisted to SQLite. CLIPScore computes live during image generation. FID runs on demand against a reference image set. Video SSIM, Video PSNR, and Video LPIPS run on demand against a reference video. Batch backfill endpoints recompute missing metrics across the full run history.
ComfyUI is a node-based generation interface. DreamLayer is a benchmarking workbench built on top of ComfyUI for image workflows, paired with dedicated Flask services for run logging, metric computation, comparison APIs, and CSV or ZIP exports. ComfyUI handles "make this image." DreamLayer handles "benchmark these models across these prompts and seeds, log everything, and let me compare results."
Automatic1111, InvokeAI, and Forge are excellent generation UIs. DreamLayer is also a great generation UIs, but it adds benchmarking infrastructure on top: persistent SQLite logging with full prompt, seed, sampler, and config metadata; built-in image and video quality metrics; side-by-side run comparison; batch metric backfills; and CSV or ZIP exports for leaderboard submission. None of those generation UIs ship with end-to-end evaluation tooling.
VBench, EvalCrafter, HEIM, and similar evaluation frameworks are standardized benchmark suites: they define fixed prompts, tasks, and scoring methods so you can report comparable benchmark results. DreamLayer is benchmarking infrastructure: you bring your own prompts, models, and configs, then run generation, scoring, run logging, and comparison workflows in one place. The two are complementary. DreamLayer’s evaluation stack also draws on HELM-style benchmarking concepts and includes video quality metrics inspired by VBench, such as temporal flickering, subject consistency, background consistency, and motion smoothness.
Can I benchmark Stable Diffusion, Flux, DALL·E, Gemini, Runway, Luma, Ideogram, and Stability AI models with DreamLayer?
Yes. DreamLayer can benchmark both local open-source models and supported API-based models. For local workflows, that includes models like Stable Diffusion 1.5, SDXL, Flux, and custom checkpoints. For API-based workflows, DreamLayer supports models shown in the UI such as Luma Labs Photon, Black Forest Labs Flux Pro, OpenAI DALL·E 3, Google Gemini Nano Banana, Runway Gen 4, Ideogram V3, and Stability AI SD Turbo. Add local model files to the Checkpoints/, Lora/, ControlNet/, and VAE/ folders, or add API keys to .env, and supported models appear in the UI for benchmarking.
Yes for Luma AI, Runway ML, and Google's Veo3. DreamLayer integrates with their video APIs out of the box via the txt2vid_server — just add the API key to .env. Sora support depends on OpenAI exposing a public video generation API. For local open-source video models that run through ComfyUI, drop the checkpoint into the appropriate folder and refresh the model list.
Yes, this is a core use case. Every run persists to SQLite with the full prompt, negative prompt, seed, sampler, steps, CFG, model hash, LoRA stack, ControlNet config, and all computed metrics. You can replay any run by run_id, sweep across multiple seeds or samplers in one batch, and compare any two runs side by side via the comparison API.
DreamLayer computes CLIPScore as the cosine similarity between CLIP text and image embeddings using the openai/clip-vit-large-patch14 backbone. The score lands in the 0 to 1 range, where higher values indicate stronger prompt-to-image alignment. No reference image is needed. CLIPScore computes live during image generation and writes directly to the metrics table, surfaced via the run registry API and included in CSV exports.
DreamLayer calculates FID using torchmetrics.image.fid.FrechetInceptionDistance with Inception-V3 features at 2048 dimensions. The default reference set is CIFAR-10, which you fetch once with python scripts/fetch_datasets.py. Lower FID indicates a closer distributional match to the reference. FID is on-demand: trigger per run via POST /api/runs/calculate-metrics, or batch-backfill across historical runs.
Yes. The metric pipeline is modular. Each metric is implemented as a standalone calculator in dream_layer_backend_utils/, registered with the database layer, and surfaced through the existing metrics, composition_metrics, or video_metrics tables. Add your computation in the same pattern as the existing calculators and register it with the database queries module to flow through the registry, comparison API, CSV export, and ZIP report bundle.
Yes. Drop .safetensors files into the auto-created Lora/, ControlNet/, and VAE/ folders, then refresh the model list in Settings. The full stack of active LoRAs (with weights), ControlNet config, and VAE choice persists with every run, so you can replay an exact LoRA and ControlNet combination by run_id or compare results across LoRA variants in a single batch.
Yes. A single benchmark run sweeps N prompts across M seeds across K samplers, and you can vary CFG, steps, and resolution per cell. Every cell becomes a row in the runs table with its own run_id and metrics. The comparison API lets you slice the resulting matrix any way you need: by sampler, by CFG value, by seed, or any combination.
Yes, on both Intel and Apple Silicon Macs. The install script ./install_mac_dependencies.sh handles PyTorch and dependency setup on either architecture. On Apple Silicon (M1, M2, M3), DreamLayer uses the MPS (Metal Performance Shaders) backend automatically for GPU-accelerated metric computation. On Intel Macs or when MPS is unavailable, DreamLayer falls back to CPU, which works for every metric but runs slower.
A run is one image or video generation event tied to a unique run_id. DreamLayer logs the prompt, negative prompt, seed, sampler, steps, CFG, model hash, LoRA stack, ControlNet config, VAE, batch size, generation type (txt2img, img2img, txt2vid, img2vid), the workflow JSON, the output filename, and every metric computed for that output. Runs persist to SQLite indefinitely and can be replayed, exported, or compared at any time.
Every run is assigned a run_id that links to its full configuration in SQLite: prompt, negative prompt, seed, sampler, steps, CFG, model hash, LoRA stack, and ControlNet config. Replay by run_id from the run registry to regenerate the exact image with the exact metrics, or fork a run by changing one parameter (such as the sampler or seed) for a controlled comparison.
No. DreamLayer runs locally on your machine, and prompts, generated images, configs, and metrics stay in your local filesystem and SQLite database by default. The only exception is when you choose to use an API-based model such as DALL·E, Flux, Ideogram, Stability AI, Runway, Luma, or Gemini, in which case the relevant request data is sent to that provider for generation. DreamLayer does not perform telemetry, analytics, or background uploads on its own.
Yes. Every Flask service exposes HTTP endpoints (txt2img, img2img, video metrics, run registry, report bundle) that you can call from a CI job. A typical pattern: trigger a fixed prompt set against a candidate model, fetch CLIPScore and aesthetic metrics from the run registry, compare against a baseline run_id from the previous release, and fail the build if any metric regresses beyond a defined threshold.
Benchmark runtime depends on the model, hardware, batch size, and selected metrics. In one representative image benchmark, DreamLayer processed 200 prompts in 45 minutes per model on an Intel MacBook Pro across API-based models including Photon, Flux Pro, DALL·E 3, Nano Banana, Runway Gen 4, Ideogram V3, and Stability SD Turbo. Using the same prompts, seeds, and configs across runs, DreamLayer handled generation, scoring, and output aggregation automatically. Larger batches and heavier metrics increase total runtime, but DreamLayer still makes reproducible benchmarking much faster than running the workflow manually.
We’re grateful to our earliest supporters who starred the repo and supported us from the start 🚀
