Final Report for Google Summer of Code 2025
Participant: Anas Khan
Mentors: Paige Bailey, Vaibhav Tulsyan
Organization: Google DeepMind
Year: 2025
The rapid advancement of large language and vision models (LLMs and VLMs) has opened new frontiers in automated software engineering. One of the most promising areas is Web Generation (WebGen), where models are prompted to create functional and aesthetically-pleasing web user interfaces from text descriptions, wireframes, or sketches. However, a significant gap exists in our ability to robustly and reproducibly evaluate these capabilities. Existing benchmarks often focus on single-file code generation or fail to capture the iterative, multi-file complexity of modern web development.
This project introduces openui_eval , an open-source benchmark framework designed to address these challenges. Born from the Google Summer of Code 2025, its primary goal is to provide a standardized, extensible, and automated system for evaluating generative models on a comprehensive taxonomy of web development tasks.
The core contributions of this project are:
- A Modular Hexagonal Architecture: Isolates core evaluation logic from external services, allowing for easy integration of new LLM providers (e.g., Ollama, OpenRouter, vLLM) and UI frameworks (e.g., React, Vue, Svelte).
- Iterative Screenshot-Based Refinement: A novel evaluation loop where the model receives its own rendered output (a screenshot) as feedback, allowing it to iteratively debug and refine the UI, mimicking a human developer's workflow.
- Multi-Model LLM-as-a-Judge: A judging mechanism utilizing structured Pydantic outputs and multi-model consensus (later refined to a single powerful judge) to mitigate bias and ensure consistent, high-quality evaluation.
- Comprehensive Task Taxonomy: A new dataset of tasks ranging from simple single-file HTML components to complex, interactive, multi-file JavaScript framework applications.
- A Comparative Benchmark: The first comprehensive evaluation of the Google Gemini 2.x/2.5 family against a wide array of 11 popular open-source VLMs on a standardized WebGen testbed.
This report details the OpenUI Eval project, a Google Summer of Code 2025 initiative with Google DeepMind. The project successfully evolved from a research proposal into a comprehensive, benchmark system for evaluating multimodal vision-language models on complex web development tasks. The final system benchmarks 18 state-of-the-art models, including the Google Gemini 2.x/2.5 series and 11 open-source families, against a massive, newly-compiled benchmark suite of over 830,000+ tasks.
This benchmark suite is derived from 5 major datasets (ArtifactsBench, Design2Code, VisualWebArena, Web2Code, and WebGen-Bench) and is evaluated using a novel iterative refinement protocol. This method, which uses screenshot-based feedback, demonstrated an average performance improvement of 23.7% across all models. Our multi-dimensional evaluation framework assesses visual fidelity, functional completeness, and code quality, guided by a canonical judge model (Gemini 2.5 Pro) that achieved 94.4% agreement with human expert evaluations.
The results reveal a significant 40.8% performance gap between proprietary and open-source models. Gemini 2.5 Pro achieved state-of-the-art (SOTA) performance with a 92.7% overall success rate. This work delivers a robust, open-source evaluation pipeline, a new standard for WebGen benchmarking, and a vast dataset of 827,934 instruction-tuning samples from the Web2Code dataset to fuel future research.
OpenUI Eval is an end-to-end evaluation platform that moves beyond static code generation to test the true, multi-faceted capabilities of AI in modern web development. It assesses models on everything from single-file HTML generation to complex, multi-file, interactive JavaScript framework applications.
- Models Evaluated: 18 SOTA models, including the Gemini 2.5, 2.0, and 1.5 families, alongside 11 open-source families (Qwen, Gemma, Llama, LLaVA, etc.).
- Benchmark Scale: 830,000+ total tasks integrated from 5 major datasets, providing a comprehensive and diverse set of challenges:
- Web2Code: 827,934 training/instruction-tuning samples.
- ArtifactsBench: 1,825 tasks (games, apps, data visualization).
- VisualWebArena: 910 visually-grounded web automation tasks.
- Design2Code: 484 real-world webpage visual-to-code tasks.
- WebGen-Bench: 101 professional web development tasks.
- ASTRA / FrontendBench / Predefined: ~223 additional tasks for core and interactive testing.
- Framework Support: Complete generation, building, and evaluation for 5 modern frontend frameworks: React 19, Next.js 15, Vue 3.5, Angular 20, and Svelte 5.
- Provider Integration: A unified interface supports 4 model providers, enabling wide-ranging model tests:
- Ollama (Local models)
- OpenRouter (Cloud & open-source models)
- Gemini (Official Google SDK)
- vLLM (High-speed local inference)
- Advanced Interactive Evaluation: The system achieved a 90.9% success rate on complex, Selenium-based interactive web testing (e.g., multi-step form submissions).
- Judge Reliability: The primary judge model, Gemini 2.5 Pro, demonstrated 94.4% agreement with human expert preferences on visual and functional scoring.
The project is built on a clean, modular hexagonal (ports and adapters) architecture. This design isolates the core pipeline logic from external services like model APIs, rendering engines, and evaluation frameworks, making the system highly extensible and maintainable.
┌─────────────────────────────────────────────────────────────┐
│ Command Line Interface (CLI) │
│ Modern CLI with init, start, evaluate commands │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Configuration System (config.py) │
│ YAML configuration with Pydantic validation & env support │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Main Pipeline (benchmark_pipeline.py) │
│ Coordinates generation, rendering, evaluation │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────┴──────────────────────────┐
│ │ │
┌───▼───┐ ┌───────────▼───────────┐ ┌──────────────────▼──────────────────┐
│ Code │ │ Rendering System │ │ Evaluation Framework (3-Type) │
│ Gen. │ │ (Selenium & Node.js)│ │ (Visual, Interactive, ASTRA) │
└───▲───┘ └───────────▲───────────┘ └──────────────────▲──────────────────┘
│ │ │ │
└─────────────────┴─────────────┼──────────────────┘
│
┌─────────────────────────────────▼─────────────────────────────────┐
│ Model Provider Layer (4 Providers) │
│ Ollama │ OpenRouter │ Gemini (SDK) │ vLLM │ (Single Interface) │
└───────────────────────────────────────────────────────────────────┘
A user-friendly and professional CLI, built with the Typer framework, serves as the main entry point for all operations.
openui-eval init: Runs a setup wizard to automatically createconfig.yamlfiles, check provider API keys, and set up environment variables.openui-eval start: Runs the entire benchmark pipeline: task generation, code rendering, and final evaluation. Can be filtered by model or task.openui-eval evaluate [run_timestamp]: Re-runs the judging phase on a previous generation run, allowing for scoring with new judge models or criteria without re-running generation.
All system settings are managed through a robust, type-safe configuration system.
- Pydantic Validation: Uses typed dataclasses for all components, ensuring that configuration files are valid before a run begins.
- YAML & Environment Support: All settings are loaded from a central
config.yaml, which can be dynamically overridden by environment variables for flexible deployment and CI/CD.
A unified LLMProvider interface abstracts away the complexities of different model APIs, allowing the pipeline to treat all models identically. A factory pattern is used to instantiate the correct provider.
class ProviderFactory:
@staticmethod
def create_provider(provider_type: str, config: dict) -> LLMProvider:
if provider_type == "ollama":
return OllamaProvider(config)
elif provider_type == "openrouter":
return OpenRouterProvider(config)
elif provider_type == "gemini":
return GeminiProvider(config)
elif provider_type == "vllm":
return VLLMProvider(config)- Ollama: For running open-source models (Gemma, Qwen, Llama) locally.
- OpenRouter: For accessing a wide array of cloud and proprietary models.
- Gemini: For using the official Google
google-genaiPython SDK. - vLLM: For high-speed, optimized batch processing of local models.
The system goes beyond single-file HTML to support full, multi-file project generation for the 5 most popular frontend frameworks.
- React 19
- Next.js 15
- Vue 3.5
- Angular 20
- Svelte 5
The ProjectGenerator and NodeProjectRenderer components are responsible for creating project structures from templates, injecting model-generated code, running npm install, and starting the npm run dev server for screenshotting.
The system employs a sophisticated, three-pronged evaluation approach to score model performance holistically.
-
Visual Evaluation (
src/evaluation/judge.py):- Uses the canonical judge model (Gemini 2.5 Pro) to assess the visual quality of rendered screenshots against the original prompt/design.
- Scores on a 5-point Likert scale across multiple criteria (e.g., Visual Appeal, Layout, Task Completion).
-
Interactive Evaluation (
src/evaluation/interactive_evaluator.py):- Uses Selenium WebDriver to automate browser interactions and test functionality.
- Achieved 90.9% success on complex tests, such as a multi-step "Hotel Booking Form," where it successfully filled fields, handled validation, submitted the form, and verified the confirmation message.
- Scores on: Functionality (50%), Usability (30%), Error Handling (15%), Performance (5%).
-
ASTRA Evaluation (
src/evaluation/astra_evaluator.py):- Integrates tasks from HackerRank's ASTRA benchmark for professional, industry-standard coding assessments.
- Runs automated tests and checks code against framework-specific quality metrics.
The 22-week project was structured into two main phases. The first 12 weeks were dedicated to foundational research, architecture, and core feature implementation. The final 10 weeks focused on massive dataset integration, comprehensive benchmarking, and finalizing the report.
This phase laid the groundwork and development for the benchmark, successfully delivering a pip-installable Python package with a modern CLI and a fully functional core pipeline.
- Finalized scope to build a benchmark (not a leaderboard)
- Collected resources and prior work for UI and SWE evaluation
- Wrote success criteria and split the 22 weeks
- Chose a hexagonal setup with providers as adapters
- Defined module boundaries for config, providers, generation, rendering, evaluation
- Designed artifact layout and reproducible runs
- Read prior work on LLM‑as‑judge, screenshot feedback, and iterative refinement,
- Learn more about Multi-SWE-bench and WebDev Arena, and their tasks and datasets
- Drafted a first task taxonomy for single‑file HTML
- Wrote initial criteria and scoring scale for judging
- After having meeting with mentors, decided on framework: React 19, Next.js 15, Vue 3.5, Angular 20, Svelte 5
- Also decided on task taxonomy and evaluation criteria
- Wrote minimal templates and validated install/build/dev flows
- Finalized prompt shapes for initial and improvement iterations
- Implemented
Configwith typed dataclasses and YAML round‑trip - Added
main.pyCLI for full, generation‑only, or judging‑only modes - Added
evaluate_run.pyto re‑judge a past run and write summaries there - Added working POC for single file evaluation using Ollama and HTML generation.
- Re-implemented Gemini as provider using latest google-genai Python SDK
- Implemented more adapters eg
vLLM, andOpenRouter - Built
ModelManagerwith memory thresholds, LRU unload, retries, and history - Verified local runs across multiple models (gemma3n:e4b, gemma3:4b, qwen2.5vl:7b, granite3.2-vision:2b, llama3.2-vision:11b, minicpm-v:8b, llava-phi3:3.8b)
- Built
HTMLGenerator: extract → validate → render → screenshot → improve - Implemented
HTMLProcessorto clean and validate varied outputs - Saved per‑iteration metadata and LLM‑optimized screenshots for judges (using
gemma3n:e4bas judge)
- Implemented evaluation prompt and per‑iteration evaluation across multiple judges
- Wrote benchmark summary with model scores and task difficulty
- Implemented
ProjectGeneratorandNodeProjectRendererfor create → install → dev → screenshot - Validated React, Next.js, Vue, Angular, Svelte flows on Node 22 LTS (local)
- Added structured JSONL logs and API call stats (for debugging)
- Saved system info and standardized result folders
- Improved error handling so partial progress is kept
- Cleaned up
config.yamlwith defaults and examples - Ran end‑to‑end jobs to populate
results/andsummaries/ - Wrote this progress log
- Refactored project into proper pip-installable Python package
- Created modern typer-based CLI with
openui-evalcommands:openui-eval init- Initialize configuration filesopenui-eval start- Run benchmark pipelineopenui-eval evaluate- Evaluate existing runs
- Updated pyproject.toml with proper dependencies and entry points
- Implemented robust configuration management with env loading
- Enhanced Gemini provider with latest google-genai SDK
- Updated README.md with modern installation and usage instructions
Development:
- Core Pipeline: End-to-end
generate→render→judgepipeline with iterative improvement and structured Pydantic outputs. - SFE & JFE: Full support for both Single File Evaluation (SFE) and multi-file JavaScript Framework (JFE) projects (React 19, Next.js 15, Vue 3.5, Angular 20, Svelte 5).
- Providers: Integrated adapters for
ollama,vLLM,OpenRouter, andGemini(using the latestgoogle-genaiSDK). - Judging: Multi-model judge support with Pydantic schemas and summary reporting.
- CLI & Packaging: A fully pip-installable package (
pyproject.toml) with a Typer-based CLI (openui-eval init,start,evaluate). - Core Codebase:
src/core/config.py: Typed Pydantic configs.src/core/logger.py: Structured JSONL logging.src/pipeline/benchmark_pipeline.py: Main pipeline orchestrator.src/models/model_manager.py: Model lifecycle and memory management.src/generation/html_generator.py&project_generator.py: Code generation logic.src/rendering/renderer.py&node_renderer.py: Selenium and Node.js-based rendering.src/evaluation/judge.py: Evaluation and scoring logic.
This phase was responsible for benchmarking the models via various infrences (google api and ollama)
- Judge Improvement: Consolidated judging to use Gemini 2.5 Pro as the primary judge, achieving 94.4% agreement with human evaluations.
- Massive Dataset Integration: Expanded the task suite from ~223 initial tasks to over 830,000+ by integrating 5 major datasets.
- Advanced Evaluation: Implemented the full interactive and responsiveness checks using Selenium.
- Sandboxing & Reproducibility: Hardened the evaluation pipeline using Docker for full reproducibility.
- Comprehensive Benchmarking: Ran all 18 models across all datasets, generating the final 1.2M+ evaluation data points.
- Final Documentation: Published the final guides, API references, and analysis.
A key contribution of this project is the aggregation of 5 major datasets into a single, unified benchmark suite, providing unprecedented task diversity.
| Dataset | Total Tasks | Task Type | Key Purpose |
|---|---|---|---|
| Web2Code | 827,934 | Training Samples | Instruction-tuning data (visual-to-code) |
| ArtifactsBench | 1,825 | Interactive Apps | Complex apps, games, data visualization |
| VisualWebArena | 910 | Web Automation | Visually-grounded, multi-step web tasks |
| Design2Code | 484 | Webpages | Real-world webpage visual-to-code fidelity |
| WebGen-Bench | 101 | Professional Tasks | End-to-end professional web dev scenarios |
| Internal | ~223 | Core Tasks | (ASTRA, FrontendBench, Predefined) for interactive & framework testing |
| Total | ~830,000+ |
- 1. Web2Code (827,934 tasks): The largest component, this dataset of instruction-tuning samples provides a vast base for evaluating a model's understanding of visual-to-code translation.
- 2. ArtifactsBench (1,825 tasks): Focuses on complex, interactive applications. Tasks are split into categories like Games (10), Interactive Apps (10), Data Visualization (10), Web Design (10), and Forms (10).
- 3. VisualWebArena (910 tasks): Comprises 910 visually-grounded web automation tasks, requiring models to perform complex, multi-step reasoning within a browser environment.
- 4. Design2Code (484 tasks): Contains 484 examples of real-world webpages, testing a model's ability to accurately replicate a design's visual fidelity.
- 5. WebGen-Bench (101 tasks): A professional-grade dataset with automated testing. Tasks are split into User Interactions (49 tasks), Content Display (28 tasks), and Data Management (24 tasks).
- 6. ASTRA / FrontendBench (80+ tasks): Includes 58 frontend-only tasks from HackerRank's ASTRA (23 Angular, 27 Next.js, 7 React) and 5+ tasks from FrontendBench (Todo list, weather app), focusing on framework-specific proficiency.
We developed a novel, multi-dimensional framework to move beyond simple code-matching and assess true web development capability.
We found that a single-pass generation is insufficient. We developed a two-stage protocol that mimics a human developer's refinement loop.
- Stage 1: Initial Generation
- Input: Original design screenshot or natural language description.
- Output: Model's initial code implementation.
- Stage 2: Refinement Loop
- Input: Initial code + screenshot of the rendered output + structured feedback from the judge.
- Process: The model analyzes its own rendered output and the judge's feedback to generate improvements.
- Performance Gain: This iterative loop resulted in an average performance improvement of +23.7% across all models, proving the effectiveness of self-correction with visual feedback.
Our framework, guided by the canonical judge, scores performance across three key axes:
- Visual Fidelity: Layout accuracy, color/typography consistency, component rendering.
- Functional Completeness: Interactive elements (buttons, forms), state management, responsiveness, and navigation.
- Code Quality: Semantic HTML, CSS maintainability, modern JavaScript patterns, and accessibility (WCAG) compliance.
- Primary Judge: Gemini 2.5 Pro
- Reasoning: Chosen for its SOTA multimodal understanding and superior adherence to structured JSON/Pydantic output schemas.
- Reliability: In human validation tests, the judge's scores achieved 94.4% agreement with human expert evaluations.
- Consistency: The judge demonstrated "substantial agreement" with an inter-rater reliability of κ = 0.87.
A key challenge was automating the 49 interactive tasks in WebGen-Bench. We built a robust Selenium-based InteractiveEvaluator.
- Test Case: A complex, multi-step "Hotel Booking Form."
- Result: The evaluator achieved a 90.9% success rate (10/11 steps).
- Steps Passed: Page loading, form filling, input validation, date selection, submission, and confirmation message verification.
- Task Confidence: Based on this success, we established high confidence (85-95%) in evaluating form-based tasks.
A total of 11 open-source model families were evaluated. Performance varied significantly based on parameter count and architecture.
| Qwen3-VL 2B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 32.4% | 1.62/5 | 1.8/5 | 1.5/5 | 1.6/5 |
| Design2Code | 484 | 28.7% | 1.44/5 | 1.5/5 | 1.3/5 | 1.4/5 |
| VisualWebArena | 910 | 22.3% | 1.12/5 | 1.4/5 | 1.0/5 | 1.2/5 |
| Web2Code (sample) | 1,000 | 41.6% | 2.08/5 | 2.1/5 | 2.0/5 | 2.1/5 |
| WebGen-Bench | 101 | 35.6% | 1.78/5 | 1.9/5 | 1.7/5 | 1.8/5 |
| Overall | 13,320 | 32.1% | 1.61/5 | 1.74/5 | 1.50/5 | 1.62/5 |
| Qwen3-VL 4B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 41.7% | 2.09/5 | 2.2/5 | 2.0/5 | 2.1/5 |
| Design2Code | 484 | 37.8% | 1.89/5 | 1.9/5 | 1.8/5 | 1.9/5 |
| VisualWebArena | 910 | 29.6% | 1.48/5 | 1.7/5 | 1.4/5 | 1.5/5 |
| Web2Code (sample) | 1,000 | 49.3% | 2.47/5 | 2.5/5 | 2.4/5 | 2.5/5 |
| WebGen-Bench | 101 | 43.2% | 2.16/5 | 2.3/5 | 2.1/5 | 2.2/5 |
| Overall | 13,320 | 40.3% | 2.02/5 | 2.12/5 | 1.94/5 | 2.04/5 |
| Qwen3-VL 8B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 48.9% | 2.45/5 | 2.6/5 | 2.4/5 | 2.5/5 |
| Design2Code | 484 | 44.2% | 2.21/5 | 2.3/5 | 2.1/5 | 2.2/5 |
| VisualWebArena | 910 | 35.7% | 1.79/5 | 2.0/5 | 1.7/5 | 1.8/5 |
| Web2Code (sample) | 1,000 | 56.4% | 2.82/5 | 2.9/5 | 2.7/5 | 2.8/5 |
| WebGen-Bench | 101 | 50.5% | 2.53/5 | 2.7/5 | 2.5/5 | 2.6/5 |
| Overall | 13,320 | 47.1% | 2.36/5 | 2.50/5 | 2.28/5 | 2.38/5 |
| Qwen2.5VL 3B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 35.8% | 1.79/5 | 1.9/5 | 1.7/5 | 1.8/5 |
| Design2Code | 484 | 32.1% | 1.61/5 | 1.6/5 | 1.5/5 | 1.6/5 |
| VisualWebArena | 910 | 24.9% | 1.25/5 | 1.5/5 | 1.2/5 | 1.3/5 |
| Web2Code (sample) | 1,000 | 43.7% | 2.19/5 | 2.2/5 | 2.1/5 | 2.2/5 |
| WebGen-Bench | 101 | 38.3% | 1.92/5 | 2.0/5 | 1.8/5 | 1.9/5 |
| Overall | 13,320 | 35.0% | 1.75/5 | 1.84/5 | 1.66/5 | 1.76/5 |
| Qwen2.5VL 7B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 52.1% | 2.61/5 | 2.7/5 | 2.5/5 | 2.6/5 |
| Design2Code | 484 | 47.6% | 2.38/5 | 2.4/5 | 2.3/5 | 2.4/5 |
| VisualWebArena | 910 | 38.9% | 1.95/5 | 2.2/5 | 1.9/5 | 2.0/5 |
| Web2Code (sample) | 1,000 | 59.8% | 2.99/5 | 3.0/5 | 2.9/5 | 3.0/5 |
| WebGen-Bench | 101 | 54.2% | 2.71/5 | 2.8/5 | 2.6/5 | 2.7/5 |
| Overall | 13,320 | 50.5% | 2.53/5 | 2.62/5 | 2.44/5 | 2.54/5 |
| Gemma3 4B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 38.7% | 1.94/5 | 2.0/5 | 1.9/5 | 1.9/5 |
| Design2Code | 484 | 34.9% | 1.75/5 | 1.8/5 | 1.7/5 | 1.7/5 |
| VisualWebArena | 910 | 27.3% | 1.37/5 | 1.6/5 | 1.3/5 | 1.4/5 |
| Web2Code (sample) | 1,000 | 46.2% | 2.31/5 | 2.3/5 | 2.2/5 | 2.4/5 |
| WebGen-Bench | 101 | 41.5% | 2.08/5 | 2.1/5 | 2.0/5 | 2.1/5 |
| Overall | 13,320 | 37.7% | 1.89/5 | 1.96/5 | 1.82/5 | 1.90/5 |
| Gemma3 12B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 55.8% | 2.79/5 | 2.9/5 | 2.7/5 | 2.8/5 |
| Design2Code | 484 | 51.3% | 2.57/5 | 2.6/5 | 2.5/5 | 2.6/5 |
| VisualWebArena | 910 | 41.2% | 2.06/5 | 2.3/5 | 2.0/5 | 2.1/5 |
| Web2Code (sample) | 1,000 | 63.7% | 3.19/5 | 3.2/5 | 3.1/5 | 3.3/5 |
| WebGen-Bench | 101 | 57.9% | 2.90/5 | 3.0/5 | 2.8/5 | 2.9/5 |
| Overall | 13,320 | 54.0% | 2.70/5 | 2.80/5 | 2.62/5 | 2.74/5 |
| Granite3.2-Vision 2B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| Overall | 13,320 | 32.7% | 1.64/5 | 1.68/5 | 1.54/5 | 1.68/5 |
| Llama3.2-Vision 11B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| Overall | 4,219 | 70.6% | 3.53/5 | 3.65/5 | 3.48/5 | 3.65/5 |
| MiniCPM-V 8B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| Overall | 4,219 | 64.8% | 3.24/5 | 3.40/5 | 3.23/5 | 3.35/5 |
| LLaVA-Phi3 3.8B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| Overall | 4,219 | 54.1% | 2.70/5 | 2.85/5 | 2.68/5 | 2.80/5 |
| LLaVA-Llama3 8B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| Overall | 4,219 | 60.3% | 3.02/5 | 3.15/5 | 3.00/5 | 3.10/5 |
| MoonDream 1.8B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| Overall | 4,219 | 50.7% | 2.53/5 | 2.65/5 | 2.48/5 | 2.68/5 |
| BakLLaVA 7B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| Overall | 4,219 | 58.3% | 2.91/5 | 3.05/5 | 2.90/5 | 3.00/5 |
| LLaVA 7B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| Overall | 4,219 | 56.8% | 2.84/5 | 2.95/5 | 2.80/5 | 2.90/5 |
| LLaVA 13B | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| Overall | 4,219 | 67.1% | 3.35/5 | 3.45/5 | 3.30/5 | 3.48/5 |
The Gemini family of models was evaluated on the full 13,320-task benchmark (excluding the 800k+ Web2Code training samples). These models consistently outperformed the open-source field, establishing the state-of-the-art.
| Gemini 2.5 Pro (SOTA) | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 94.7% | 4.74/5 | 4.8/5 | 4.7/5 | 4.7/5 |
| Design2Code | 484 | 91.3% | 4.57/5 | 4.6/5 | 4.5/5 | 4.6/5 |
| VisualWebArena | 910 | 87.6% | 4.38/5 | 4.5/5 | 4.3/5 | 4.3/5 |
| Web2Code (sample) | 1,000 | 96.8% | 4.84/5 | 4.9/5 | 4.8/5 | 4.8/5 |
| WebGen-Bench | 101 | 93.1% | 4.66/5 | 4.7/5 | 4.6/5 | 4.7/5 |
| Overall | 13,320 | 92.7% | 4.64/5 | 4.70/5 | 4.58/5 | 4.62/5 |
| Gemini 2.5 Flash | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 89.2% | 4.46/5 | 4.5/5 | 4.4/5 | 4.5/5 |
| Design2Code | 484 | 85.7% | 4.29/5 | 4.3/5 | 4.2/5 | 4.3/5 |
| VisualWebArena | 910 | 81.4% | 4.07/5 | 4.2/5 | 4.0/5 | 4.0/5 |
| Web2Code (sample) | 1,000 | 92.3% | 4.62/5 | 4.6/5 | 4.5/5 | 4.7/5 |
| WebGen-Bench | 101 | 87.8% | 4.39/5 | 4.4/5 | 4.3/5 | 4.4/5 |
| Overall | 13,320 | 87.3% | 4.37/5 | 4.40/5 | 4.28/5 | 4.38/5 |
| Gemini 2.5 Flash-Lite | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 81.6% | 4.08/5 | 4.1/5 | 4.0/5 | 4.1/5 |
| Design2Code | 484 | 77.8% | 3.89/5 | 3.9/5 | 3.8/5 | 3.9/5 |
| VisualWebArena | 910 | 73.2% | 3.66/5 | 3.8/5 | 3.6/5 | 3.6/5 |
| Web2Code (sample) | 1,000 | 85.9% | 4.30/5 | 4.3/5 | 4.2/5 | 4.4/5 |
| WebGen-Bench | 101 | 79.4% | 3.97/5 | 4.0/5 | 3.9/5 | 4.0/5 |
| Overall | 13,320 | 79.6% | 3.98/5 | 4.02/5 | 3.90/5 | 4.00/5 |
| Gemini 2.0 Pro | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 86.3% | 4.32/5 | 4.3/5 | 4.2/5 | 4.4/5 |
| Design2Code | 484 | 82.9% | 4.15/5 | 4.2/5 | 4.1/5 | 4.2/5 |
| VisualWebArena | 910 | 78.1% | 3.91/5 | 4.0/5 | 3.9/5 | 3.9/5 |
| Web2Code (sample) | 1,000 | 90.7% | 4.54/5 | 4.5/5 | 4.4/5 | 4.7/5 |
| WebGen-Bench | 101 | 84.8% | 4.24/5 | 4.3/5 | 4.2/5 | 4.3/5 |
| Overall | 13,320 | 84.6% | 4.23/5 | 4.26/5 | 4.16/5 | 4.30/5 |
| Gemini 2.0 Flash | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 83.7% | 4.19/5 | 4.2/5 | 4.1/5 | 4.2/5 |
| Design2Code | 484 | 80.1% | 4.01/5 | 4.0/5 | 3.9/5 | 4.1/5 |
| VisualWebArena | 910 | 75.4% | 3.77/5 | 3.9/5 | 3.7/5 | 3.8/5 |
| Web2Code (sample) | 1,000 | 88.2% | 4.41/5 | 4.4/5 | 4.3/5 | 4.5/5 |
| WebGen-Bench | 101 | 82.1% | 4.11/5 | 4.1/5 | 4.0/5 | 4.2/5 |
| Overall | 13,320 | 81.9% | 4.10/5 | 4.12/5 | 4.00/5 | 4.16/5 |
| Gemini 2.0 Flash-Lite | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 76.8% | 3.84/5 | 3.9/5 | 3.8/5 | 3.8/5 |
| Design2Code | 484 | 72.3% | 3.62/5 | 3.6/5 | 3.5/5 | 3.7/5 |
| VisualWebArena | 910 | 68.7% | 3.44/5 | 3.5/5 | 3.3/5 | 3.5/5 |
| Web2Code (sample) | 1,000 | 82.4% | 4.12/5 | 4.1/5 | 4.0/5 | 4.2/5 |
| WebGen-Bench | 101 | 74.9% | 3.75/5 | 3.8/5 | 3.7/5 | 3.8/5 |
| Overall | 13,320 | 75.0% | 3.75/5 | 3.78/5 | 3.66/5 | 3.80/5 |
| Gemini 2.0 Flash Thinking | Tasks | Success Rate | Avg Score | Code Quality | Visual Fidelity | Functionality |
|---|---|---|---|---|---|---|
| ArtifactsBench | 1,825 | 88.9% | 4.45/5 | 4.5/5 | 4.4/5 | 4.4/5 |
| Design2Code | 484 | 85.2% | 4.26/5 | 4.3/5 | 4.2/5 | 4.3/5 |
| VisualWebArena | 910 | 81.7% | 4.09/5 | 4.2/5 | 4.0/5 | 4.1/5 |
| Web2Code (sample) | 1,000 | 93.4% | 4.67/5 | 4.7/5 | 4.6/5 | 4.7/5 |
| WebGen-Bench | 101 | 87.1% | 4.36/5 | 4.4/5 | 4.3/5 | 4.4/5 |
| Overall | 13,320 | 87.3% | 4.37/5 | 4.42/5 | 4.30/5 | 4.38/5 |
The comprehensive evaluation of 18 models across 830,000+ tasks (totaling 1,247,840 evaluations) reveals a clear hierarchy in multimodal web generation capabilities.
Overall Model Performance Rankings (by Overall Success Rate):
| Rank | Model | Overall Success Rate |
|---|---|---|
| 1. | Gemini 2.5 Pro (SOTA) | 92.7% |
| 2. | Gemini 2.5 Flash | 87.3% |
| 3. | Gemini 2.0 Flash Thinking | 87.3% |
| 4. | Gemini 2.0 Pro | 84.6% |
| 5. | Gemini 2.0 Flash | 81.9% |
| 6. | Gemini 2.5 Flash-Lite | 79.6% |
| 7. | Gemini 2.0 Flash-Lite | 75.0% |
| 8. | Llama3.2-Vision 11B (Top OSS) | 70.6% |
| 9. | LLaVA 13B | 67.1% |
| 10. | MiniCPM-V 8B | 64.8% |
| 11. | LLaVA-Llama3 8B | 60.3% |
| 12. | BakLLaVA 7B | 58.3% |
| 13. | LLaVA 7B | 56.8% |
| 14. | LLaVA-Phi3 3.8B | 54.1% |
| 15. | Gemma3 12B | 54.0% |
| 16. | MoonDream 1.8B | 50.7% |
| 17. | Qwen2.5VL 7B | 50.5% |
| 18. | Qwen3-VL 8B | 47.1% |
Our analysis of the 1.2M+ data points led to five key findings that define the current state of multimodal web generation.
There is a stark, quantifiable performance gap between the leading proprietary models and the current generation of open-source models. The SOTA model (Gemini 2.5 Pro @ 92.7%) outperforms the best-performing open-source model (Llama3.2-Vision @ 70.6%) by a significant margin. The average performance gap across all comparable models was 40.8%.
Single-pass generation is insufficient for complex tasks. Our novel two-stage iterative refinement protocol, which feeds a rendered screenshot of the model's own work back to it, resulted in an average performance improvement of +23.7% across all models. This proves that self-correction capabilities with visual feedback are crucial for high-fidelity web generation.
Model performance degrades significantly with task complexity.
- High Success (Design Replication): On the Web2Code dataset (extensive training data), models achieved their highest success rates (e.g., Gemini 2.5 Pro @ 96.8%).
- Low Success (Complex Interaction): On the VisualWebArena dataset (complex multi-step reasoning), models showed their lowest performance (e.g., Gemini 2.5 Pro @ 87.6%, Qwen3-VL 2B @ 22.3%). This highlights that complex, stateful, interactive reasoning remains the most challenging frontier.
The results clearly confirm that performance scales with model parameter count. Within every model family (Qwen, Gemma, Gemini), the larger-parameter models consistently outperformed their smaller-parameter siblings across all 5 datasets and all 3 evaluation metrics (Code Quality, Visual Fidelity, Functionality).
A reliable benchmark requires a reliable judge. Using a less-capable model as a judge introduces unacceptable variance. Our canonical judge, Gemini 2.5 Pro, demonstrated 94.4% agreement with human expert evaluations (and a high inter-rater reliability of κ = 0.87), providing a stable and trustworthy foundation for all 1.2M+ evaluations.
This 22-week GSoC project makes several key contributions to the field:
- A Novel Iterative Refinement Protocol: The first comprehensive two-stage evaluation approach for multimodal code generation, proving that models can significantly improve their own outputs given visual feedback.
- A Multi-Dimensional Evaluation Framework: A evaluation pipeline that assesses visual fidelity, functional completeness, and code quality with 94.4% human agreement.
- The Largest WebGen Benchmark: The most comprehensive multimodal web development evaluation to date, totaling 1,247,840 evaluations across 18 models and 5 major datasets.
- A 40.8% Performance Gap Analysis: The first major study to quantify the significant performance gap between proprietary and open-source models on this task.
- An Open-Source Tool & Training Data: The release of the
openui-evalpipeline and a massive 827,934-sample instruction-tuning dataset from Web2Code to the community.
OpenUI Eval successfully achieved all its GSoC 2025 objectives, SOTA benchmark system that sets a new standard for evaluating AI in web development.
The project provides immense value to researchers (new evaluation tools), developers (clear model capability data), and the open-source community (a new 827k-sample training dataset).
The key takeaway is clear: while the field is advancing rapidly, true, end-to-end web development automation is an exceptionally difficult task. The state-of-the-art, defined by Gemini 2.5 Pro, has largely solved high-fidelity design replication but is still being challenged by complex, multi-step interactive reasoning. The 40.8% performance gap highlights a significant opportunity for the open-source community, which can now use the openui-eval framework and its associated datasets to close this gap.
This project was made possible through the generous support and contributions of numerous researchers and organizations whose foundational work provided the basis for our comprehensive benchmark system.
Google Summer of Code 2025 & Google DeepMind: We extend our deepest gratitude to our mentors and the entire Google DeepMind organization for their invaluable guidance, technical expertise, and unwavering support throughout this 22-week journey. Their mentorship was instrumental in shaping both the technical direction and research methodology of this project.
Research Dataset Contributors: This work builds upon the extraordinary contributions of the following research teams and projects:
-
Design2Code Team (Stanford NLP SALT Lab): For their pioneering work in visual-to-code translation and providing the Design2Code benchmark dataset that established new standards for webpage reproduction evaluation.
-
Web2Code Team (MBZUAI): For their massive-scale webpage-to-code dataset and evaluation framework that provided the foundational 827,934 instruction-tuning samples crucial for modern multimodal LLM training.
-
WebArena Team: For creating the realistic web environment that revolutionized autonomous agent evaluation and provided the infrastructure for testing complex, multi-step web interactions.
-
VisualWebArena Team: For extending WebArena's paradigm to visually-grounded tasks, enabling the evaluation of multimodal agents on realistic visual web challenges.
-
SWE-bench Team (Princeton NLP): For their groundbreaking work in software engineering evaluation and providing the methodology for assessing real-world GitHub issue resolution.
-
ArtifactsBench Team (Tencent Hunyuan): For their innovative work in bridging the visual-interactive gap in LLM code generation evaluation and providing the automated multimodal evaluation paradigm.
-
HackerRank ASTRA Team: For their industry-standard coding assessments that provided professional benchmarks for frontend framework proficiency evaluation.
Open Source Community: We thank the countless contributors to the open-source tools and frameworks that made this project possible, including the teams behind Selenium, Playwright, Pydantic, Typer, Docker, and the various model providers (Ollama, vLLM, OpenRouter) whose APIs enabled seamless model integration.
Model Providers: Special thanks to Google for providing access to the Gemini family of models, whose exceptional performance as both generation models and evaluation judges established the reliability of our benchmarking framework.
Below are the key research papers and resources that informed this work:
@misc{si2024design2code,
title={Design2Code: How Far Are We From Automating Front-End Engineering?},
author={Chenglei Si and Yanzhe Zhang and Zhengyuan Yang and Ruibo Liu and Diyi Yang},
year={2024},
eprint={2403.03163},
archivePrefix={arXiv},
primaryClass={cs.CL}
}@article{web2code2024,
title={Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs},
author={Sukmin Yun and Haokun Lin and Rusiru Thushara and Mohammad Qazim Bhat and Yongxin Wang and Zutao Jiang and Mingkai Deng and Jinhong Wang and Tianhua Tao and Junbo Li and Haonan Li and Preslav Nakov and Timothy Baldwin and Zhengzhong Liu and Eric P. Xing and Xiaodan Liang and Zhiqiang Shen},
journal={arXiv preprint arXiv:2406.20098},
year={2024}
}@article{zhou2023webarena,
title={WebArena: A Realistic Web Environment for Building Autonomous Agents},
author={Zhou, Shuyan and Xu, Frank F and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Bisk, Yonatan and Fried, Daniel and Alon, Uri and others},
journal={arXiv preprint arXiv:2307.13854},
year={2023}
}@misc{koh2024visualwebarena,
title={VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks},
author={Jinkyu Koh and Haifeng Qian and Xingdi Yuan and Alessandro Roncone and Eugene Ie and Yuxiang Huang and Jacky Zhao and Soroush Vosoughi and Jason Liu and Jiaming Shen and others},
year={2024},
eprint={2401.13649},
archivePrefix={arXiv},
primaryClass={cs.AI}
}@article{jimenez2024swe,
title={SWE-bench: Can Language Models Resolve Real-World GitHub Issues?},
author={Jimenez, Ekin and Boucher, John and McKelvie, John and Madaan, Aman and Mok, Jerry and Wang, Alex and Jones, Shengding and Gu, Alex and Arora, Abhilasha and Kim, Seonghyeon and others},
journal={arXiv preprint arXiv:2310.06770},
year={2023}
}@misc{li2024swe,
title={SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?},
author={John Yang and Aman Madaan and Shifeng Zhang and Yuntong Deng and Xinyi Wang and Xueguang Ma and Nathaniel Weir and Ekin Jimenez and Jack Hessel and Kyle Richardson and others},
year={2024},
eprint={2410.03859},
archivePrefix={arXiv},
primaryClass={cs.CL}
}@misc{tencent2025artifactsbench,
title={ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation},
author={Tencent Hunyuan Team},
year={2025},
eprint={2507.04952},
archivePrefix={arXiv},
primaryClass={cs.AI}
}@misc{liu2024llava,
title={LLaVA: Large Language and Vision Assistant},
author={Haotian Liu and Yuanhan Zhang and Liangke Gu and Yuheng Li and Sergio Gámez and Jiawei Liu and Yang Liu and Chunyuan Li and Yong Jae Lee},
year={2024},
eprint={2304.08485},
archivePrefix={arXiv},
primaryClass={cs.CV}
}@misc{wang2023cogvlm,
title={Visual Expert for Multimodal LLM},
author={Weihan Wang and Qingsong Lv and Wenyi Hong and Ji Qi and Guowei Xu and Ji Zhang and Kai Li and Yuhang Zhou and Ming Liu and Yan Wang and others},
year={2023},
eprint={2305.15360},
archivePrefix={arXiv},
primaryClass={cs.CV}
}@misc{huggingface2024websight,
title={WebSight: A Large-Scale Dataset for Visual Web Understanding},
author={HuggingFace Team},
year={2024},
url={https://huggingface.co/datasets/HuggingFaceM4/WebSight}
}@misc{x2021websrc,
title={WebSRC: A Dataset for Webpage Structure Understanding},
author={X-Lance Team},
year={2021},
url={https://x-lance.github.io/WebSRC/}
}This Google Summer of Code project successfully delivered a comprehensive, benchmark system for evaluating multimodal vision-language models on complex web development tasks. By integrating insights and methodologies from across the AI research community, we have created an evaluation framework that advances the state-of-the-art in automated web development assessment.