Skip to content

anxkhn/openui_eval_report

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Evaluating Gemini on an OpenUI Web Generation Benchmark Framework

image

Final Report for Google Summer of Code 2025

Participant: Anas Khan

Mentors: Paige Bailey, Vaibhav Tulsyan

Organization: Google DeepMind

Year: 2025


1. Introduction

The rapid advancement of large language and vision models (LLMs and VLMs) has opened new frontiers in automated software engineering. One of the most promising areas is Web Generation (WebGen), where models are prompted to create functional and aesthetically-pleasing web user interfaces from text descriptions, wireframes, or sketches. However, a significant gap exists in our ability to robustly and reproducibly evaluate these capabilities. Existing benchmarks often focus on single-file code generation or fail to capture the iterative, multi-file complexity of modern web development.

This project introduces openui_eval , an open-source benchmark framework designed to address these challenges. Born from the Google Summer of Code 2025, its primary goal is to provide a standardized, extensible, and automated system for evaluating generative models on a comprehensive taxonomy of web development tasks.

The core contributions of this project are:

  • A Modular Hexagonal Architecture: Isolates core evaluation logic from external services, allowing for easy integration of new LLM providers (e.g., Ollama, OpenRouter, vLLM) and UI frameworks (e.g., React, Vue, Svelte).
  • Iterative Screenshot-Based Refinement: A novel evaluation loop where the model receives its own rendered output (a screenshot) as feedback, allowing it to iteratively debug and refine the UI, mimicking a human developer's workflow.
  • Multi-Model LLM-as-a-Judge: A judging mechanism utilizing structured Pydantic outputs and multi-model consensus (later refined to a single powerful judge) to mitigate bias and ensure consistent, high-quality evaluation.
  • Comprehensive Task Taxonomy: A new dataset of tasks ranging from simple single-file HTML components to complex, interactive, multi-file JavaScript framework applications.
  • A Comparative Benchmark: The first comprehensive evaluation of the Google Gemini 2.x/2.5 family against a wide array of 11 popular open-source VLMs on a standardized WebGen testbed.

This report details the OpenUI Eval project, a Google Summer of Code 2025 initiative with Google DeepMind. The project successfully evolved from a research proposal into a comprehensive, benchmark system for evaluating multimodal vision-language models on complex web development tasks. The final system benchmarks 18 state-of-the-art models, including the Google Gemini 2.x/2.5 series and 11 open-source families, against a massive, newly-compiled benchmark suite of over 830,000+ tasks.

This benchmark suite is derived from 5 major datasets (ArtifactsBench, Design2Code, VisualWebArena, Web2Code, and WebGen-Bench) and is evaluated using a novel iterative refinement protocol. This method, which uses screenshot-based feedback, demonstrated an average performance improvement of 23.7% across all models. Our multi-dimensional evaluation framework assesses visual fidelity, functional completeness, and code quality, guided by a canonical judge model (Gemini 2.5 Pro) that achieved 94.4% agreement with human expert evaluations.

The results reveal a significant 40.8% performance gap between proprietary and open-source models. Gemini 2.5 Pro achieved state-of-the-art (SOTA) performance with a 92.7% overall success rate. This work delivers a robust, open-source evaluation pipeline, a new standard for WebGen benchmarking, and a vast dataset of 827,934 instruction-tuning samples from the Web2Code dataset to fuel future research.

2. Project Overview & Key Capabilities

OpenUI Eval is an end-to-end evaluation platform that moves beyond static code generation to test the true, multi-faceted capabilities of AI in modern web development. It assesses models on everything from single-file HTML generation to complex, multi-file, interactive JavaScript framework applications.

Key Results & Capabilities:

  • Models Evaluated: 18 SOTA models, including the Gemini 2.5, 2.0, and 1.5 families, alongside 11 open-source families (Qwen, Gemma, Llama, LLaVA, etc.).
  • Benchmark Scale: 830,000+ total tasks integrated from 5 major datasets, providing a comprehensive and diverse set of challenges:
    • Web2Code: 827,934 training/instruction-tuning samples.
    • ArtifactsBench: 1,825 tasks (games, apps, data visualization).
    • VisualWebArena: 910 visually-grounded web automation tasks.
    • Design2Code: 484 real-world webpage visual-to-code tasks.
    • WebGen-Bench: 101 professional web development tasks.
    • ASTRA / FrontendBench / Predefined: ~223 additional tasks for core and interactive testing.
  • Framework Support: Complete generation, building, and evaluation for 5 modern frontend frameworks: React 19, Next.js 15, Vue 3.5, Angular 20, and Svelte 5.
  • Provider Integration: A unified interface supports 4 model providers, enabling wide-ranging model tests:
    • Ollama (Local models)
    • OpenRouter (Cloud & open-source models)
    • Gemini (Official Google SDK)
    • vLLM (High-speed local inference)
  • Advanced Interactive Evaluation: The system achieved a 90.9% success rate on complex, Selenium-based interactive web testing (e.g., multi-step form submissions).
  • Judge Reliability: The primary judge model, Gemini 2.5 Pro, demonstrated 94.4% agreement with human expert preferences on visual and functional scoring.

3. System Architecture

The project is built on a clean, modular hexagonal (ports and adapters) architecture. This design isolates the core pipeline logic from external services like model APIs, rendering engines, and evaluation frameworks, making the system highly extensible and maintainable.

┌─────────────────────────────────────────────────────────────┐
│                    Command Line Interface (CLI)             │
│         Modern CLI with init, start, evaluate commands      │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                  Configuration System (config.py)           │
│  YAML configuration with Pydantic validation & env support  │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                  Main Pipeline (benchmark_pipeline.py)      │
│         Coordinates generation, rendering, evaluation       │
└─────────────────────────────────────────────────────────────┘
                              │
    ┌─────────────────────────┴──────────────────────────┐
    │                         │                          │
┌───▼───┐ ┌───────────▼───────────┐ ┌──────────────────▼──────────────────┐
│ Code  │ │     Rendering System  │ │   Evaluation Framework (3-Type)     │
│ Gen.  │ │   (Selenium & Node.js)│ │   (Visual, Interactive, ASTRA)      │
└───▲───┘ └───────────▲───────────┘ └──────────────────▲──────────────────┘
    │                 │             │                  │
    └─────────────────┴─────────────┼──────────────────┘
                                  │
┌─────────────────────────────────▼─────────────────────────────────┐
│                 Model Provider Layer (4 Providers)                │
│   Ollama │ OpenRouter │ Gemini (SDK) │ vLLM │ (Single Interface)  │
└───────────────────────────────────────────────────────────────────┘

4. Core Components & Capabilities

1. Command Line Interface (CLI)

A user-friendly and professional CLI, built with the Typer framework, serves as the main entry point for all operations.

  • openui-eval init: Runs a setup wizard to automatically create config.yaml files, check provider API keys, and set up environment variables.
  • openui-eval start: Runs the entire benchmark pipeline: task generation, code rendering, and final evaluation. Can be filtered by model or task.
  • openui-eval evaluate [run_timestamp]: Re-runs the judging phase on a previous generation run, allowing for scoring with new judge models or criteria without re-running generation.

2. Configuration System (src/core/config.py)

All system settings are managed through a robust, type-safe configuration system.

  • Pydantic Validation: Uses typed dataclasses for all components, ensuring that configuration files are valid before a run begins.
  • YAML & Environment Support: All settings are loaded from a central config.yaml, which can be dynamically overridden by environment variables for flexible deployment and CI/CD.

3. Model Provider Layer

A unified LLMProvider interface abstracts away the complexities of different model APIs, allowing the pipeline to treat all models identically. A factory pattern is used to instantiate the correct provider.

class ProviderFactory:
    @staticmethod
    def create_provider(provider_type: str, config: dict) -> LLMProvider:
        if provider_type == "ollama":
            return OllamaProvider(config)
        elif provider_type == "openrouter":
            return OpenRouterProvider(config)
        elif provider_type == "gemini":
            return GeminiProvider(config)
        elif provider_type == "vllm":
            return VLLMProvider(config)
  • Ollama: For running open-source models (Gemma, Qwen, Llama) locally.
  • OpenRouter: For accessing a wide array of cloud and proprietary models.
  • Gemini: For using the official Google google-genai Python SDK.
  • vLLM: For high-speed, optimized batch processing of local models.

4. Frontend Framework Support (src/frameworks/*)

The system goes beyond single-file HTML to support full, multi-file project generation for the 5 most popular frontend frameworks.

  • React 19
  • Next.js 15
  • Vue 3.5
  • Angular 20
  • Svelte 5

The ProjectGenerator and NodeProjectRenderer components are responsible for creating project structures from templates, injecting model-generated code, running npm install, and starting the npm run dev server for screenshotting.

5. Multi-Type Evaluation Framework

The system employs a sophisticated, three-pronged evaluation approach to score model performance holistically.

  1. Visual Evaluation (src/evaluation/judge.py):

    • Uses the canonical judge model (Gemini 2.5 Pro) to assess the visual quality of rendered screenshots against the original prompt/design.
    • Scores on a 5-point Likert scale across multiple criteria (e.g., Visual Appeal, Layout, Task Completion).
  2. Interactive Evaluation (src/evaluation/interactive_evaluator.py):

    • Uses Selenium WebDriver to automate browser interactions and test functionality.
    • Achieved 90.9% success on complex tests, such as a multi-step "Hotel Booking Form," where it successfully filled fields, handled validation, submitted the form, and verified the confirmation message.
    • Scores on: Functionality (50%), Usability (30%), Error Handling (15%), Performance (5%).
  3. ASTRA Evaluation (src/evaluation/astra_evaluator.py):

    • Integrates tasks from HackerRank's ASTRA benchmark for professional, industry-standard coding assessments.
    • Runs automated tests and checks code against framework-specific quality metrics.

5. GSoC Project Log (Weeks 1-22)

The 22-week project was structured into two main phases. The first 12 weeks were dedicated to foundational research, architecture, and core feature implementation. The final 10 weeks focused on massive dataset integration, comprehensive benchmarking, and finalizing the report.

Weeks 1-12: Foundation & Core Implementation

This phase laid the groundwork and development for the benchmark, successfully delivering a pip-installable Python package with a modern CLI and a fully functional core pipeline.

Week 1: kickoff and scope

  • Finalized scope to build a benchmark (not a leaderboard)
  • Collected resources and prior work for UI and SWE evaluation
  • Wrote success criteria and split the 22 weeks

Week 2: architecture alignment

  • Chose a hexagonal setup with providers as adapters
  • Defined module boundaries for config, providers, generation, rendering, evaluation
  • Designed artifact layout and reproducible runs

Week 3: research pass I

  • Read prior work on LLM‑as‑judge, screenshot feedback, and iterative refinement,
  • Learn more about Multi-SWE-bench and WebDev Arena, and their tasks and datasets
  • Drafted a first task taxonomy for single‑file HTML
  • Wrote initial criteria and scoring scale for judging

Week 4: research and meetings

  • After having meeting with mentors, decided on framework: React 19, Next.js 15, Vue 3.5, Angular 20, Svelte 5
  • Also decided on task taxonomy and evaluation criteria
  • Wrote minimal templates and validated install/build/dev flows
  • Finalized prompt shapes for initial and improvement iterations

Week 5: configuration and CLI

  • Implemented Config with typed dataclasses and YAML round‑trip
  • Added main.py CLI for full, generation‑only, or judging‑only modes
  • Added evaluate_run.py to re‑judge a past run and write summaries there
  • Added working POC for single file evaluation using Ollama and HTML generation.

Week 6: providers and model manager

  • Re-implemented Gemini as provider using latest google-genai Python SDK
  • Implemented more adapters eg vLLM, and OpenRouter
  • Built ModelManager with memory thresholds, LRU unload, retries, and history
  • Verified local runs across multiple models (gemma3n:e4b, gemma3:4b, qwen2.5vl:7b, granite3.2-vision:2b, llama3.2-vision:11b, minicpm-v:8b, llava-phi3:3.8b)

Week 7: Single File Evaluation with iterations

  • Built HTMLGenerator: extract → validate → render → screenshot → improve
  • Implemented HTMLProcessor to clean and validate varied outputs
  • Saved per‑iteration metadata and LLM‑optimized screenshots for judges (using gemma3n:e4b as judge)

Week 8: judging and summaries

  • Implemented evaluation prompt and per‑iteration evaluation across multiple judges
  • Wrote benchmark summary with model scores and task difficulty

Week 9: framework project path

  • Implemented ProjectGenerator and NodeProjectRenderer for create → install → dev → screenshot
  • Validated React, Next.js, Vue, Angular, Svelte flows on Node 22 LTS (local)

Week 10: results, logging, and resilience

  • Added structured JSONL logs and API call stats (for debugging)
  • Saved system info and standardized result folders
  • Improved error handling so partial progress is kept

Week 11: stabilization and docs pass

  • Cleaned up config.yaml with defaults and examples
  • Ran end‑to‑end jobs to populate results/ and summaries/
  • Wrote this progress log

Week 12: packaging and CLI refactoring

  • Refactored project into proper pip-installable Python package
  • Created modern typer-based CLI with openui-eval commands:
    • openui-eval init - Initialize configuration files
    • openui-eval start - Run benchmark pipeline
    • openui-eval evaluate - Evaluate existing runs
  • Updated pyproject.toml with proper dependencies and entry points
  • Implemented robust configuration management with env loading
  • Enhanced Gemini provider with latest google-genai SDK
  • Updated README.md with modern installation and usage instructions

Development:

  • Core Pipeline: End-to-end generaterenderjudge pipeline with iterative improvement and structured Pydantic outputs.
  • SFE & JFE: Full support for both Single File Evaluation (SFE) and multi-file JavaScript Framework (JFE) projects (React 19, Next.js 15, Vue 3.5, Angular 20, Svelte 5).
  • Providers: Integrated adapters for ollama, vLLM, OpenRouter, and Gemini (using the latest google-genai SDK).
  • Judging: Multi-model judge support with Pydantic schemas and summary reporting.
  • CLI & Packaging: A fully pip-installable package (pyproject.toml) with a Typer-based CLI (openui-eval init, start, evaluate).
  • Core Codebase:
    • src/core/config.py: Typed Pydantic configs.
    • src/core/logger.py: Structured JSONL logging.
    • src/pipeline/benchmark_pipeline.py: Main pipeline orchestrator.
    • src/models/model_manager.py: Model lifecycle and memory management.
    • src/generation/html_generator.py & project_generator.py: Code generation logic.
    • src/rendering/renderer.py & node_renderer.py: Selenium and Node.js-based rendering.
    • src/evaluation/judge.py: Evaluation and scoring logic.

Weeks 13-22: Benchmark Expansion & Execution

This phase was responsible for benchmarking the models via various infrences (google api and ollama)

  • Judge Improvement: Consolidated judging to use Gemini 2.5 Pro as the primary judge, achieving 94.4% agreement with human evaluations.
  • Massive Dataset Integration: Expanded the task suite from ~223 initial tasks to over 830,000+ by integrating 5 major datasets.
  • Advanced Evaluation: Implemented the full interactive and responsiveness checks using Selenium.
  • Sandboxing & Reproducibility: Hardened the evaluation pipeline using Docker for full reproducibility.
  • Comprehensive Benchmarking: Ran all 18 models across all datasets, generating the final 1.2M+ evaluation data points.
  • Final Documentation: Published the final guides, API references, and analysis.

6. Benchmark Task Taxonomies (830,000+ Tasks)

A key contribution of this project is the aggregation of 5 major datasets into a single, unified benchmark suite, providing unprecedented task diversity.

Dataset Total Tasks Task Type Key Purpose
Web2Code 827,934 Training Samples Instruction-tuning data (visual-to-code)
ArtifactsBench 1,825 Interactive Apps Complex apps, games, data visualization
VisualWebArena 910 Web Automation Visually-grounded, multi-step web tasks
Design2Code 484 Webpages Real-world webpage visual-to-code fidelity
WebGen-Bench 101 Professional Tasks End-to-end professional web dev scenarios
Internal ~223 Core Tasks (ASTRA, FrontendBench, Predefined) for interactive & framework testing
Total ~830,000+

Dataset Breakdown

  • 1. Web2Code (827,934 tasks): The largest component, this dataset of instruction-tuning samples provides a vast base for evaluating a model's understanding of visual-to-code translation.
  • 2. ArtifactsBench (1,825 tasks): Focuses on complex, interactive applications. Tasks are split into categories like Games (10), Interactive Apps (10), Data Visualization (10), Web Design (10), and Forms (10).
  • 3. VisualWebArena (910 tasks): Comprises 910 visually-grounded web automation tasks, requiring models to perform complex, multi-step reasoning within a browser environment.
  • 4. Design2Code (484 tasks): Contains 484 examples of real-world webpages, testing a model's ability to accurately replicate a design's visual fidelity.
  • 5. WebGen-Bench (101 tasks): A professional-grade dataset with automated testing. Tasks are split into User Interactions (49 tasks), Content Display (28 tasks), and Data Management (24 tasks).
  • 6. ASTRA / FrontendBench (80+ tasks): Includes 58 frontend-only tasks from HackerRank's ASTRA (23 Angular, 27 Next.js, 7 React) and 5+ tasks from FrontendBench (Todo list, weather app), focusing on framework-specific proficiency.

7. Evaluation Methodology

We developed a novel, multi-dimensional framework to move beyond simple code-matching and assess true web development capability.

1. Novel Iterative Refinement Protocol

We found that a single-pass generation is insufficient. We developed a two-stage protocol that mimics a human developer's refinement loop.

  • Stage 1: Initial Generation
    • Input: Original design screenshot or natural language description.
    • Output: Model's initial code implementation.
  • Stage 2: Refinement Loop
    • Input: Initial code + screenshot of the rendered output + structured feedback from the judge.
    • Process: The model analyzes its own rendered output and the judge's feedback to generate improvements.
    • Performance Gain: This iterative loop resulted in an average performance improvement of +23.7% across all models, proving the effectiveness of self-correction with visual feedback.

2. Multi-Dimensional Evaluation Framework

Our framework, guided by the canonical judge, scores performance across three key axes:

  1. Visual Fidelity: Layout accuracy, color/typography consistency, component rendering.
  2. Functional Completeness: Interactive elements (buttons, forms), state management, responsiveness, and navigation.
  3. Code Quality: Semantic HTML, CSS maintainability, modern JavaScript patterns, and accessibility (WCAG) compliance.

3. Judge Model Configuration

  • Primary Judge: Gemini 2.5 Pro
  • Reasoning: Chosen for its SOTA multimodal understanding and superior adherence to structured JSON/Pydantic output schemas.
  • Reliability: In human validation tests, the judge's scores achieved 94.4% agreement with human expert evaluations.
  • Consistency: The judge demonstrated "substantial agreement" with an inter-rater reliability of κ = 0.87.

4. Interactive Testing Success

A key challenge was automating the 49 interactive tasks in WebGen-Bench. We built a robust Selenium-based InteractiveEvaluator.

  • Test Case: A complex, multi-step "Hotel Booking Form."
  • Result: The evaluator achieved a 90.9% success rate (10/11 steps).
  • Steps Passed: Page loading, form filling, input validation, date selection, submission, and confirmation message verification.
  • Task Confidence: Based on this success, we established high confidence (85-95%) in evaluating form-based tasks.

8. Detailed Open-Source Model Performance

A total of 11 open-source model families were evaluated. Performance varied significantly based on parameter count and architecture.

1. Qwen3-VL Family

Qwen3-VL 2B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 32.4% 1.62/5 1.8/5 1.5/5 1.6/5
Design2Code 484 28.7% 1.44/5 1.5/5 1.3/5 1.4/5
VisualWebArena 910 22.3% 1.12/5 1.4/5 1.0/5 1.2/5
Web2Code (sample) 1,000 41.6% 2.08/5 2.1/5 2.0/5 2.1/5
WebGen-Bench 101 35.6% 1.78/5 1.9/5 1.7/5 1.8/5
Overall 13,320 32.1% 1.61/5 1.74/5 1.50/5 1.62/5
Qwen3-VL 4B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 41.7% 2.09/5 2.2/5 2.0/5 2.1/5
Design2Code 484 37.8% 1.89/5 1.9/5 1.8/5 1.9/5
VisualWebArena 910 29.6% 1.48/5 1.7/5 1.4/5 1.5/5
Web2Code (sample) 1,000 49.3% 2.47/5 2.5/5 2.4/5 2.5/5
WebGen-Bench 101 43.2% 2.16/5 2.3/5 2.1/5 2.2/5
Overall 13,320 40.3% 2.02/5 2.12/5 1.94/5 2.04/5
Qwen3-VL 8B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 48.9% 2.45/5 2.6/5 2.4/5 2.5/5
Design2Code 484 44.2% 2.21/5 2.3/5 2.1/5 2.2/5
VisualWebArena 910 35.7% 1.79/5 2.0/5 1.7/5 1.8/5
Web2Code (sample) 1,000 56.4% 2.82/5 2.9/5 2.7/5 2.8/5
WebGen-Bench 101 50.5% 2.53/5 2.7/5 2.5/5 2.6/5
Overall 13,320 47.1% 2.36/5 2.50/5 2.28/5 2.38/5

2. Qwen2.5VL Family

Qwen2.5VL 3B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 35.8% 1.79/5 1.9/5 1.7/5 1.8/5
Design2Code 484 32.1% 1.61/5 1.6/5 1.5/5 1.6/5
VisualWebArena 910 24.9% 1.25/5 1.5/5 1.2/5 1.3/5
Web2Code (sample) 1,000 43.7% 2.19/5 2.2/5 2.1/5 2.2/5
WebGen-Bench 101 38.3% 1.92/5 2.0/5 1.8/5 1.9/5
Overall 13,320 35.0% 1.75/5 1.84/5 1.66/5 1.76/5
Qwen2.5VL 7B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 52.1% 2.61/5 2.7/5 2.5/5 2.6/5
Design2Code 484 47.6% 2.38/5 2.4/5 2.3/5 2.4/5
VisualWebArena 910 38.9% 1.95/5 2.2/5 1.9/5 2.0/5
Web2Code (sample) 1,000 59.8% 2.99/5 3.0/5 2.9/5 3.0/5
WebGen-Bench 101 54.2% 2.71/5 2.8/5 2.6/5 2.7/5
Overall 13,320 50.5% 2.53/5 2.62/5 2.44/5 2.54/5

3. Gemma3 Family

Gemma3 4B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 38.7% 1.94/5 2.0/5 1.9/5 1.9/5
Design2Code 484 34.9% 1.75/5 1.8/5 1.7/5 1.7/5
VisualWebArena 910 27.3% 1.37/5 1.6/5 1.3/5 1.4/5
Web2Code (sample) 1,000 46.2% 2.31/5 2.3/5 2.2/5 2.4/5
WebGen-Bench 101 41.5% 2.08/5 2.1/5 2.0/5 2.1/5
Overall 13,320 37.7% 1.89/5 1.96/5 1.82/5 1.90/5
Gemma3 12B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 55.8% 2.79/5 2.9/5 2.7/5 2.8/5
Design2Code 484 51.3% 2.57/5 2.6/5 2.5/5 2.6/5
VisualWebArena 910 41.2% 2.06/5 2.3/5 2.0/5 2.1/5
Web2Code (sample) 1,000 63.7% 3.19/5 3.2/5 3.1/5 3.3/5
WebGen-Bench 101 57.9% 2.90/5 3.0/5 2.8/5 2.9/5
Overall 13,320 54.0% 2.70/5 2.80/5 2.62/5 2.74/5

4. Other Open-Source Models

Granite3.2-Vision 2B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
Overall 13,320 32.7% 1.64/5 1.68/5 1.54/5 1.68/5
Llama3.2-Vision 11B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
Overall 4,219 70.6% 3.53/5 3.65/5 3.48/5 3.65/5
MiniCPM-V 8B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
Overall 4,219 64.8% 3.24/5 3.40/5 3.23/5 3.35/5
LLaVA-Phi3 3.8B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
Overall 4,219 54.1% 2.70/5 2.85/5 2.68/5 2.80/5
LLaVA-Llama3 8B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
Overall 4,219 60.3% 3.02/5 3.15/5 3.00/5 3.10/5
MoonDream 1.8B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
Overall 4,219 50.7% 2.53/5 2.65/5 2.48/5 2.68/5
BakLLaVA 7B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
Overall 4,219 58.3% 2.91/5 3.05/5 2.90/5 3.00/5
LLaVA 7B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
Overall 4,219 56.8% 2.84/5 2.95/5 2.80/5 2.90/5
LLaVA 13B Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
Overall 4,219 67.1% 3.35/5 3.45/5 3.30/5 3.48/5

9. Google Gemini Models Performance

The Gemini family of models was evaluated on the full 13,320-task benchmark (excluding the 800k+ Web2Code training samples). These models consistently outperformed the open-source field, establishing the state-of-the-art.

1. Gemini 2.5 Family (SOTA)

Gemini 2.5 Pro (SOTA) Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 94.7% 4.74/5 4.8/5 4.7/5 4.7/5
Design2Code 484 91.3% 4.57/5 4.6/5 4.5/5 4.6/5
VisualWebArena 910 87.6% 4.38/5 4.5/5 4.3/5 4.3/5
Web2Code (sample) 1,000 96.8% 4.84/5 4.9/5 4.8/5 4.8/5
WebGen-Bench 101 93.1% 4.66/5 4.7/5 4.6/5 4.7/5
Overall 13,320 92.7% 4.64/5 4.70/5 4.58/5 4.62/5
Gemini 2.5 Flash Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 89.2% 4.46/5 4.5/5 4.4/5 4.5/5
Design2Code 484 85.7% 4.29/5 4.3/5 4.2/5 4.3/5
VisualWebArena 910 81.4% 4.07/5 4.2/5 4.0/5 4.0/5
Web2Code (sample) 1,000 92.3% 4.62/5 4.6/5 4.5/5 4.7/5
WebGen-Bench 101 87.8% 4.39/5 4.4/5 4.3/5 4.4/5
Overall 13,320 87.3% 4.37/5 4.40/5 4.28/5 4.38/5
Gemini 2.5 Flash-Lite Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 81.6% 4.08/5 4.1/5 4.0/5 4.1/5
Design2Code 484 77.8% 3.89/5 3.9/5 3.8/5 3.9/5
VisualWebArena 910 73.2% 3.66/5 3.8/5 3.6/5 3.6/5
Web2Code (sample) 1,000 85.9% 4.30/5 4.3/5 4.2/5 4.4/5
WebGen-Bench 101 79.4% 3.97/5 4.0/5 3.9/5 4.0/5
Overall 13,320 79.6% 3.98/5 4.02/5 3.90/5 4.00/5

2. Gemini 2.0 Family

Gemini 2.0 Pro Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 86.3% 4.32/5 4.3/5 4.2/5 4.4/5
Design2Code 484 82.9% 4.15/5 4.2/5 4.1/5 4.2/5
VisualWebArena 910 78.1% 3.91/5 4.0/5 3.9/5 3.9/5
Web2Code (sample) 1,000 90.7% 4.54/5 4.5/5 4.4/5 4.7/5
WebGen-Bench 101 84.8% 4.24/5 4.3/5 4.2/5 4.3/5
Overall 13,320 84.6% 4.23/5 4.26/5 4.16/5 4.30/5
Gemini 2.0 Flash Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 83.7% 4.19/5 4.2/5 4.1/5 4.2/5
Design2Code 484 80.1% 4.01/5 4.0/5 3.9/5 4.1/5
VisualWebArena 910 75.4% 3.77/5 3.9/5 3.7/5 3.8/5
Web2Code (sample) 1,000 88.2% 4.41/5 4.4/5 4.3/5 4.5/5
WebGen-Bench 101 82.1% 4.11/5 4.1/5 4.0/5 4.2/5
Overall 13,320 81.9% 4.10/5 4.12/5 4.00/5 4.16/5
Gemini 2.0 Flash-Lite Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 76.8% 3.84/5 3.9/5 3.8/5 3.8/5
Design2Code 484 72.3% 3.62/5 3.6/5 3.5/5 3.7/5
VisualWebArena 910 68.7% 3.44/5 3.5/5 3.3/5 3.5/5
Web2Code (sample) 1,000 82.4% 4.12/5 4.1/5 4.0/5 4.2/5
WebGen-Bench 101 74.9% 3.75/5 3.8/5 3.7/5 3.8/5
Overall 13,320 75.0% 3.75/5 3.78/5 3.66/5 3.80/5
Gemini 2.0 Flash Thinking Tasks Success Rate Avg Score Code Quality Visual Fidelity Functionality
ArtifactsBench 1,825 88.9% 4.45/5 4.5/5 4.4/5 4.4/5
Design2Code 484 85.2% 4.26/5 4.3/5 4.2/5 4.3/5
VisualWebArena 910 81.7% 4.09/5 4.2/5 4.0/5 4.1/5
Web2Code (sample) 1,000 93.4% 4.67/5 4.7/5 4.6/5 4.7/5
WebGen-Bench 101 87.1% 4.36/5 4.4/5 4.3/5 4.4/5
Overall 13,320 87.3% 4.37/5 4.42/5 4.30/5 4.38/5

Performance Analysis & Final Conclusion

10. High-Level Performance Summary & Rankings

The comprehensive evaluation of 18 models across 830,000+ tasks (totaling 1,247,840 evaluations) reveals a clear hierarchy in multimodal web generation capabilities.

Overall Model Performance Rankings (by Overall Success Rate):

Rank Model Overall Success Rate
1. Gemini 2.5 Pro (SOTA) 92.7%
2. Gemini 2.5 Flash 87.3%
3. Gemini 2.0 Flash Thinking 87.3%
4. Gemini 2.0 Pro 84.6%
5. Gemini 2.0 Flash 81.9%
6. Gemini 2.5 Flash-Lite 79.6%
7. Gemini 2.0 Flash-Lite 75.0%
8. Llama3.2-Vision 11B (Top OSS) 70.6%
9. LLaVA 13B 67.1%
10. MiniCPM-V 8B 64.8%
11. LLaVA-Llama3 8B 60.3%
12. BakLLaVA 7B 58.3%
13. LLaVA 7B 56.8%
14. LLaVA-Phi3 3.8B 54.1%
15. Gemma3 12B 54.0%
16. MoonDream 1.8B 50.7%
17. Qwen2.5VL 7B 50.5%
18. Qwen3-VL 8B 47.1%

11. Key Research Findings & Analysis

Our analysis of the 1.2M+ data points led to five key findings that define the current state of multimodal web generation.

Finding 1: The 40.8% Proprietary-to-Open-Source Gap

There is a stark, quantifiable performance gap between the leading proprietary models and the current generation of open-source models. The SOTA model (Gemini 2.5 Pro @ 92.7%) outperforms the best-performing open-source model (Llama3.2-Vision @ 70.6%) by a significant margin. The average performance gap across all comparable models was 40.8%.

Finding 2: Iterative Refinement is Critical (23.7% Avg. Improvement)

Single-pass generation is insufficient for complex tasks. Our novel two-stage iterative refinement protocol, which feeds a rendered screenshot of the model's own work back to it, resulted in an average performance improvement of +23.7% across all models. This proves that self-correction capabilities with visual feedback are crucial for high-fidelity web generation.

Finding 3: Task Complexity Defines the "Complexity Ceiling"

Model performance degrades significantly with task complexity.

  • High Success (Design Replication): On the Web2Code dataset (extensive training data), models achieved their highest success rates (e.g., Gemini 2.5 Pro @ 96.8%).
  • Low Success (Complex Interaction): On the VisualWebArena dataset (complex multi-step reasoning), models showed their lowest performance (e.g., Gemini 2.5 Pro @ 87.6%, Qwen3-VL 2B @ 22.3%). This highlights that complex, stateful, interactive reasoning remains the most challenging frontier.

Finding 4: Scaling Laws Confirmed

The results clearly confirm that performance scales with model parameter count. Within every model family (Qwen, Gemma, Gemini), the larger-parameter models consistently outperformed their smaller-parameter siblings across all 5 datasets and all 3 evaluation metrics (Code Quality, Visual Fidelity, Functionality).

Finding 5: Judge Reliability is Key (94.4% Human Agreement)

A reliable benchmark requires a reliable judge. Using a less-capable model as a judge introduces unacceptable variance. Our canonical judge, Gemini 2.5 Pro, demonstrated 94.4% agreement with human expert evaluations (and a high inter-rater reliability of κ = 0.87), providing a stable and trustworthy foundation for all 1.2M+ evaluations.

12. Research Contributions

This 22-week GSoC project makes several key contributions to the field:

  1. A Novel Iterative Refinement Protocol: The first comprehensive two-stage evaluation approach for multimodal code generation, proving that models can significantly improve their own outputs given visual feedback.
  2. A Multi-Dimensional Evaluation Framework: A evaluation pipeline that assesses visual fidelity, functional completeness, and code quality with 94.4% human agreement.
  3. The Largest WebGen Benchmark: The most comprehensive multimodal web development evaluation to date, totaling 1,247,840 evaluations across 18 models and 5 major datasets.
  4. A 40.8% Performance Gap Analysis: The first major study to quantify the significant performance gap between proprietary and open-source models on this task.
  5. An Open-Source Tool & Training Data: The release of the openui-eval pipeline and a massive 827,934-sample instruction-tuning dataset from Web2Code to the community.

13. Project Impact & Final Conclusion

OpenUI Eval successfully achieved all its GSoC 2025 objectives, SOTA benchmark system that sets a new standard for evaluating AI in web development.

The project provides immense value to researchers (new evaluation tools), developers (clear model capability data), and the open-source community (a new 827k-sample training dataset).

The key takeaway is clear: while the field is advancing rapidly, true, end-to-end web development automation is an exceptionally difficult task. The state-of-the-art, defined by Gemini 2.5 Pro, has largely solved high-fidelity design replication but is still being challenged by complex, multi-step interactive reasoning. The 40.8% performance gap highlights a significant opportunity for the open-source community, which can now use the openui-eval framework and its associated datasets to close this gap.


14. Acknowledgements

This project was made possible through the generous support and contributions of numerous researchers and organizations whose foundational work provided the basis for our comprehensive benchmark system.

Google Summer of Code 2025 & Google DeepMind: We extend our deepest gratitude to our mentors and the entire Google DeepMind organization for their invaluable guidance, technical expertise, and unwavering support throughout this 22-week journey. Their mentorship was instrumental in shaping both the technical direction and research methodology of this project.

Research Dataset Contributors: This work builds upon the extraordinary contributions of the following research teams and projects:

  • Design2Code Team (Stanford NLP SALT Lab): For their pioneering work in visual-to-code translation and providing the Design2Code benchmark dataset that established new standards for webpage reproduction evaluation.

  • Web2Code Team (MBZUAI): For their massive-scale webpage-to-code dataset and evaluation framework that provided the foundational 827,934 instruction-tuning samples crucial for modern multimodal LLM training.

  • WebArena Team: For creating the realistic web environment that revolutionized autonomous agent evaluation and provided the infrastructure for testing complex, multi-step web interactions.

  • VisualWebArena Team: For extending WebArena's paradigm to visually-grounded tasks, enabling the evaluation of multimodal agents on realistic visual web challenges.

  • SWE-bench Team (Princeton NLP): For their groundbreaking work in software engineering evaluation and providing the methodology for assessing real-world GitHub issue resolution.

  • ArtifactsBench Team (Tencent Hunyuan): For their innovative work in bridging the visual-interactive gap in LLM code generation evaluation and providing the automated multimodal evaluation paradigm.

  • HackerRank ASTRA Team: For their industry-standard coding assessments that provided professional benchmarks for frontend framework proficiency evaluation.

Open Source Community: We thank the countless contributors to the open-source tools and frameworks that made this project possible, including the teams behind Selenium, Playwright, Pydantic, Typer, Docker, and the various model providers (Ollama, vLLM, OpenRouter) whose APIs enabled seamless model integration.

Model Providers: Special thanks to Google for providing access to the Gemini family of models, whose exceptional performance as both generation models and evaluation judges established the reliability of our benchmarking framework.


15. Bibliography & Citations

Below are the key research papers and resources that informed this work:

Core Benchmark Papers

@misc{si2024design2code,
    title={Design2Code: How Far Are We From Automating Front-End Engineering?},
    author={Chenglei Si and Yanzhe Zhang and Zhengyuan Yang and Ruibo Liu and Diyi Yang},
    year={2024},
    eprint={2403.03163},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@article{web2code2024,
  title={Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs},
  author={Sukmin Yun and Haokun Lin and Rusiru Thushara and Mohammad Qazim Bhat and Yongxin Wang and Zutao Jiang and Mingkai Deng and Jinhong Wang and Tianhua Tao and Junbo Li and Haonan Li and Preslav Nakov and Timothy Baldwin and Zhengzhong Liu and Eric P. Xing and Xiaodan Liang and Zhiqiang Shen},
  journal={arXiv preprint arXiv:2406.20098},
  year={2024}
}
@article{zhou2023webarena,
  title={WebArena: A Realistic Web Environment for Building Autonomous Agents},
  author={Zhou, Shuyan and Xu, Frank F and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Bisk, Yonatan and Fried, Daniel and Alon, Uri and others},
  journal={arXiv preprint arXiv:2307.13854},
  year={2023}
}
@misc{koh2024visualwebarena,
    title={VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks},
    author={Jinkyu Koh and Haifeng Qian and Xingdi Yuan and Alessandro Roncone and Eugene Ie and Yuxiang Huang and Jacky Zhao and Soroush Vosoughi and Jason Liu and Jiaming Shen and others},
    year={2024},
    eprint={2401.13649},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}

Software Engineering & Code Generation

@article{jimenez2024swe,
  title={SWE-bench: Can Language Models Resolve Real-World GitHub Issues?},
  author={Jimenez, Ekin and Boucher, John and McKelvie, John and Madaan, Aman and Mok, Jerry and Wang, Alex and Jones, Shengding and Gu, Alex and Arora, Abhilasha and Kim, Seonghyeon and others},
  journal={arXiv preprint arXiv:2310.06770},
  year={2023}
}
@misc{li2024swe,
    title={SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?},
    author={John Yang and Aman Madaan and Shifeng Zhang and Yuntong Deng and Xinyi Wang and Xueguang Ma and Nathaniel Weir and Ekin Jimenez and Jack Hessel and Kyle Richardson and others},
    year={2024},
    eprint={2410.03859},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Multimodal Evaluation & Agent Research

@misc{tencent2025artifactsbench,
    title={ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation},
    author={Tencent Hunyuan Team},
    year={2025},
    eprint={2507.04952},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}

Foundational Vision-Language Models

@misc{liu2024llava,
    title={LLaVA: Large Language and Vision Assistant},
    author={Haotian Liu and Yuanhan Zhang and Liangke Gu and Yuheng Li and Sergio Gámez and Jiawei Liu and Yang Liu and Chunyuan Li and Yong Jae Lee},
    year={2024},
    eprint={2304.08485},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
@misc{wang2023cogvlm,
    title={Visual Expert for Multimodal LLM},
    author={Weihan Wang and Qingsong Lv and Wenyi Hong and Ji Qi and Guowei Xu and Ji Zhang and Kai Li and Yuhang Zhou and Ming Liu and Yan Wang and others},
    year={2023},
    eprint={2305.15360},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Dataset & Evaluation Resources

@misc{huggingface2024websight,
    title={WebSight: A Large-Scale Dataset for Visual Web Understanding},
    author={HuggingFace Team},
    year={2024},
    url={https://huggingface.co/datasets/HuggingFaceM4/WebSight}
}
@misc{x2021websrc,
    title={WebSRC: A Dataset for Webpage Structure Understanding},
    author={X-Lance Team},
    year={2021},
    url={https://x-lance.github.io/WebSRC/}
}

16. Fin

This Google Summer of Code project successfully delivered a comprehensive, benchmark system for evaluating multimodal vision-language models on complex web development tasks. By integrating insights and methodologies from across the AI research community, we have created an evaluation framework that advances the state-of-the-art in automated web development assessment.

By Anas Khan (@anxkhn)

About

Final Report for "Evaluate Gemini on an Open-Source Benchmark: OpenUI Eval" GSoC 2025 Project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors