Evaluating Gemini on an OpenUI Web Generation Benchmark Framework

Final Report for Google Summer of Code 2025

Participant: Anas Khan

Mentors: Paige Bailey, Vaibhav Tulsyan

Organization: Google DeepMind

Year: 2025

1. Introduction

The rapid advancement of large language and vision models (LLMs and VLMs) has opened new frontiers in automated software engineering. One of the most promising areas is Web Generation (WebGen), where models are prompted to create functional and aesthetically-pleasing web user interfaces from text descriptions, wireframes, or sketches. However, a significant gap exists in our ability to robustly and reproducibly evaluate these capabilities. Existing benchmarks often focus on single-file code generation or fail to capture the iterative, multi-file complexity of modern web development.

This project introduces openui_eval , an open-source benchmark framework designed to address these challenges. Born from the Google Summer of Code 2025, its primary goal is to provide a standardized, extensible, and automated system for evaluating generative models on a comprehensive taxonomy of web development tasks.

The core contributions of this project are:

A Modular Hexagonal Architecture: Isolates core evaluation logic from external services, allowing for easy integration of new LLM providers (e.g., Ollama, OpenRouter, vLLM) and UI frameworks (e.g., React, Vue, Svelte).
Iterative Screenshot-Based Refinement: A novel evaluation loop where the model receives its own rendered output (a screenshot) as feedback, allowing it to iteratively debug and refine the UI, mimicking a human developer's workflow.
Multi-Model LLM-as-a-Judge: A judging mechanism utilizing structured Pydantic outputs and multi-model consensus (later refined to a single powerful judge) to mitigate bias and ensure consistent, high-quality evaluation.
Comprehensive Task Taxonomy: A new dataset of tasks ranging from simple single-file HTML components to complex, interactive, multi-file JavaScript framework applications.
A Comparative Benchmark: The first comprehensive evaluation of the Google Gemini 2.x/2.5 family against a wide array of 11 popular open-source VLMs on a standardized WebGen testbed.

This report details the OpenUI Eval project, a Google Summer of Code 2025 initiative with Google DeepMind. The project successfully evolved from a research proposal into a comprehensive, benchmark system for evaluating multimodal vision-language models on complex web development tasks. The final system benchmarks 18 state-of-the-art models, including the Google Gemini 2.x/2.5 series and 11 open-source families, against a massive, newly-compiled benchmark suite of over 830,000+ tasks.

This benchmark suite is derived from 5 major datasets (ArtifactsBench, Design2Code, VisualWebArena, Web2Code, and WebGen-Bench) and is evaluated using a novel iterative refinement protocol. This method, which uses screenshot-based feedback, demonstrated an average performance improvement of 23.7% across all models. Our multi-dimensional evaluation framework assesses visual fidelity, functional completeness, and code quality, guided by a canonical judge model (Gemini 2.5 Pro) that achieved 94.4% agreement with human expert evaluations.

The results reveal a significant 40.8% performance gap between proprietary and open-source models. Gemini 2.5 Pro achieved state-of-the-art (SOTA) performance with a 92.7% overall success rate. This work delivers a robust, open-source evaluation pipeline, a new standard for WebGen benchmarking, and a vast dataset of 827,934 instruction-tuning samples from the Web2Code dataset to fuel future research.

2. Project Overview & Key Capabilities

OpenUI Eval is an end-to-end evaluation platform that moves beyond static code generation to test the true, multi-faceted capabilities of AI in modern web development. It assesses models on everything from single-file HTML generation to complex, multi-file, interactive JavaScript framework applications.

Key Results & Capabilities:

Models Evaluated: 18 SOTA models, including the Gemini 2.5, 2.0, and 1.5 families, alongside 11 open-source families (Qwen, Gemma, Llama, LLaVA, etc.).
Benchmark Scale: 830,000+ total tasks integrated from 5 major datasets, providing a comprehensive and diverse set of challenges:
- Web2Code: 827,934 training/instruction-tuning samples.
- ArtifactsBench: 1,825 tasks (games, apps, data visualization).
- VisualWebArena: 910 visually-grounded web automation tasks.
- Design2Code: 484 real-world webpage visual-to-code tasks.
- WebGen-Bench: 101 professional web development tasks.
- ASTRA / FrontendBench / Predefined: ~223 additional tasks for core and interactive testing.
Framework Support: Complete generation, building, and evaluation for 5 modern frontend frameworks: React 19, Next.js 15, Vue 3.5, Angular 20, and Svelte 5.
Provider Integration: A unified interface supports 4 model providers, enabling wide-ranging model tests:
- Ollama (Local models)
- OpenRouter (Cloud & open-source models)
- Gemini (Official Google SDK)
- vLLM (High-speed local inference)
Advanced Interactive Evaluation: The system achieved a 90.9% success rate on complex, Selenium-based interactive web testing (e.g., multi-step form submissions).
Judge Reliability: The primary judge model, Gemini 2.5 Pro, demonstrated 94.4% agreement with human expert preferences on visual and functional scoring.

3. System Architecture

The project is built on a clean, modular hexagonal (ports and adapters) architecture. This design isolates the core pipeline logic from external services like model APIs, rendering engines, and evaluation frameworks, making the system highly extensible and maintainable.

┌─────────────────────────────────────────────────────────────┐
│                    Command Line Interface (CLI)             │
│         Modern CLI with init, start, evaluate commands      │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                  Configuration System (config.py)           │
│  YAML configuration with Pydantic validation & env support  │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                  Main Pipeline (benchmark_pipeline.py)      │
│         Coordinates generation, rendering, evaluation       │
└─────────────────────────────────────────────────────────────┘
                              │
    ┌─────────────────────────┴──────────────────────────┐
    │                         │                          │
┌───▼───┐ ┌───────────▼───────────┐ ┌──────────────────▼──────────────────┐
│ Code  │ │     Rendering System  │ │   Evaluation Framework (3-Type)     │
│ Gen.  │ │   (Selenium & Node.js)│ │   (Visual, Interactive, ASTRA)      │
└───▲───┘ └───────────▲───────────┘ └──────────────────▲──────────────────┘
    │                 │             │                  │
    └─────────────────┴─────────────┼──────────────────┘
                                  │
┌─────────────────────────────────▼─────────────────────────────────┐
│                 Model Provider Layer (4 Providers)                │
│   Ollama │ OpenRouter │ Gemini (SDK) │ vLLM │ (Single Interface)  │
└───────────────────────────────────────────────────────────────────┘

4. Core Components & Capabilities

1. Command Line Interface (CLI)

A user-friendly and professional CLI, built with the Typer framework, serves as the main entry point for all operations.

openui-eval init: Runs a setup wizard to automatically create config.yaml files, check provider API keys, and set up environment variables.
openui-eval start: Runs the entire benchmark pipeline: task generation, code rendering, and final evaluation. Can be filtered by model or task.
openui-eval evaluate [run_timestamp]: Re-runs the judging phase on a previous generation run, allowing for scoring with new judge models or criteria without re-running generation.

2. Configuration System (`src/core/config.py`)

All system settings are managed through a robust, type-safe configuration system.

Pydantic Validation: Uses typed dataclasses for all components, ensuring that configuration files are valid before a run begins.
YAML & Environment Support: All settings are loaded from a central config.yaml, which can be dynamically overridden by environment variables for flexible deployment and CI/CD.

3. Model Provider Layer

A unified LLMProvider interface abstracts away the complexities of different model APIs, allowing the pipeline to treat all models identically. A factory pattern is used to instantiate the correct provider.

class ProviderFactory:
    @staticmethod
    def create_provider(provider_type: str, config: dict) -> LLMProvider:
        if provider_type == "ollama":
            return OllamaProvider(config)
        elif provider_type == "openrouter":
            return OpenRouterProvider(config)
        elif provider_type == "gemini":
            return GeminiProvider(config)
        elif provider_type == "vllm":
            return VLLMProvider(config)

Ollama: For running open-source models (Gemma, Qwen, Llama) locally.
OpenRouter: For accessing a wide array of cloud and proprietary models.
Gemini: For using the official Google google-genai Python SDK.
vLLM: For high-speed, optimized batch processing of local models.

4. Frontend Framework Support (`src/frameworks/*`)

The system goes beyond single-file HTML to support full, multi-file project generation for the 5 most popular frontend frameworks.

React 19
Next.js 15
Vue 3.5
Angular 20
Svelte 5

The ProjectGenerator and NodeProjectRenderer components are responsible for creating project structures from templates, injecting model-generated code, running npm install, and starting the npm run dev server for screenshotting.

5. Multi-Type Evaluation Framework

The system employs a sophisticated, three-pronged evaluation approach to score model performance holistically.

Visual Evaluation (src/evaluation/judge.py):
- Uses the canonical judge model (Gemini 2.5 Pro) to assess the visual quality of rendered screenshots against the original prompt/design.
- Scores on a 5-point Likert scale across multiple criteria (e.g., Visual Appeal, Layout, Task Completion).
Interactive Evaluation (src/evaluation/interactive_evaluator.py):
- Uses Selenium WebDriver to automate browser interactions and test functionality.
- Achieved 90.9% success on complex tests, such as a multi-step "Hotel Booking Form," where it successfully filled fields, handled validation, submitted the form, and verified the confirmation message.
- Scores on: Functionality (50%), Usability (30%), Error Handling (15%), Performance (5%).
ASTRA Evaluation (src/evaluation/astra_evaluator.py):
- Integrates tasks from HackerRank's ASTRA benchmark for professional, industry-standard coding assessments.
- Runs automated tests and checks code against framework-specific quality metrics.

5. GSoC Project Log (Weeks 1-22)

The 22-week project was structured into two main phases. The first 12 weeks were dedicated to foundational research, architecture, and core feature implementation. The final 10 weeks focused on massive dataset integration, comprehensive benchmarking, and finalizing the report.

Weeks 1-12: Foundation & Core Implementation

This phase laid the groundwork and development for the benchmark, successfully delivering a pip-installable Python package with a modern CLI and a fully functional core pipeline.

Week 1: kickoff and scope

Finalized scope to build a benchmark (not a leaderboard)
Collected resources and prior work for UI and SWE evaluation
Wrote success criteria and split the 22 weeks

Week 2: architecture alignment

Chose a hexagonal setup with providers as adapters
Defined module boundaries for config, providers, generation, rendering, evaluation
Designed artifact layout and reproducible runs

Week 3: research pass I

Read prior work on LLM‑as‑judge, screenshot feedback, and iterative refinement,
Learn more about Multi-SWE-bench and WebDev Arena, and their tasks and datasets
Drafted a first task taxonomy for single‑file HTML
Wrote initial criteria and scoring scale for judging

Week 4: research and meetings

After having meeting with mentors, decided on framework: React 19, Next.js 15, Vue 3.5, Angular 20, Svelte 5
Also decided on task taxonomy and evaluation criteria
Wrote minimal templates and validated install/build/dev flows
Finalized prompt shapes for initial and improvement iterations

Week 5: configuration and CLI

Implemented Config with typed dataclasses and YAML round‑trip
Added main.py CLI for full, generation‑only, or judging‑only modes
Added evaluate_run.py to re‑judge a past run and write summaries there
Added working POC for single file evaluation using Ollama and HTML generation.

Week 6: providers and model manager

Re-implemented Gemini as provider using latest google-genai Python SDK
Implemented more adapters eg vLLM, and OpenRouter
Built ModelManager with memory thresholds, LRU unload, retries, and history
Verified local runs across multiple models (gemma3n:e4b, gemma3:4b, qwen2.5vl:7b, granite3.2-vision:2b, llama3.2-vision:11b, minicpm-v:8b, llava-phi3:3.8b)

Week 7: Single File Evaluation with iterations

Built HTMLGenerator: extract → validate → render → screenshot → improve
Implemented HTMLProcessor to clean and validate varied outputs
Saved per‑iteration metadata and LLM‑optimized screenshots for judges (using gemma3n:e4b as judge)

Week 8: judging and summaries

Implemented evaluation prompt and per‑iteration evaluation across multiple judges
Wrote benchmark summary with model scores and task difficulty

Week 9: framework project path

Implemented ProjectGenerator and NodeProjectRenderer for create → install → dev → screenshot
Validated React, Next.js, Vue, Angular, Svelte flows on Node 22 LTS (local)

Week 10: results, logging, and resilience

Added structured JSONL logs and API call stats (for debugging)
Saved system info and standardized result folders
Improved error handling so partial progress is kept

Week 11: stabilization and docs pass

Cleaned up config.yaml with defaults and examples
Ran end‑to‑end jobs to populate results/ and summaries/
Wrote this progress log

Week 12: packaging and CLI refactoring

Refactored project into proper pip-installable Python package
Created modern typer-based CLI with openui-eval commands:
- openui-eval init - Initialize configuration files
- openui-eval start - Run benchmark pipeline
- openui-eval evaluate - Evaluate existing runs
Updated pyproject.toml with proper dependencies and entry points
Implemented robust configuration management with env loading
Enhanced Gemini provider with latest google-genai SDK
Updated README.md with modern installation and usage instructions

Development:

Core Pipeline: End-to-end generate → render → judge pipeline with iterative improvement and structured Pydantic outputs.
SFE & JFE: Full support for both Single File Evaluation (SFE) and multi-file JavaScript Framework (JFE) projects (React 19, Next.js 15, Vue 3.5, Angular 20, Svelte 5).
Providers: Integrated adapters for ollama, vLLM, OpenRouter, and Gemini (using the latest google-genai SDK).
Judging: Multi-model judge support with Pydantic schemas and summary reporting.
CLI & Packaging: A fully pip-installable package (pyproject.toml) with a Typer-based CLI (openui-eval init, start, evaluate).
Core Codebase:
- src/core/config.py: Typed Pydantic configs.
- src/core/logger.py: Structured JSONL logging.
- src/pipeline/benchmark_pipeline.py: Main pipeline orchestrator.
- src/models/model_manager.py: Model lifecycle and memory management.
- src/generation/html_generator.py & project_generator.py: Code generation logic.
- src/rendering/renderer.py & node_renderer.py: Selenium and Node.js-based rendering.
- src/evaluation/judge.py: Evaluation and scoring logic.

Weeks 13-22: Benchmark Expansion & Execution

This phase was responsible for benchmarking the models via various infrences (google api and ollama)

Judge Improvement: Consolidated judging to use Gemini 2.5 Pro as the primary judge, achieving 94.4% agreement with human evaluations.
Massive Dataset Integration: Expanded the task suite from ~223 initial tasks to over 830,000+ by integrating 5 major datasets.
Advanced Evaluation: Implemented the full interactive and responsiveness checks using Selenium.
Sandboxing & Reproducibility: Hardened the evaluation pipeline using Docker for full reproducibility.
Comprehensive Benchmarking: Ran all 18 models across all datasets, generating the final 1.2M+ evaluation data points.
Final Documentation: Published the final guides, API references, and analysis.

6. Benchmark Task Taxonomies (830,000+ Tasks)

A key contribution of this project is the aggregation of 5 major datasets into a single, unified benchmark suite, providing unprecedented task diversity.

Dataset	Total Tasks	Task Type	Key Purpose
Web2Code	827,934	Training Samples	Instruction-tuning data (visual-to-code)
ArtifactsBench	1,825	Interactive Apps	Complex apps, games, data visualization
VisualWebArena	910	Web Automation	Visually-grounded, multi-step web tasks
Design2Code	484	Webpages	Real-world webpage visual-to-code fidelity
WebGen-Bench	101	Professional Tasks	End-to-end professional web dev scenarios
Internal	~223	Core Tasks	(ASTRA, FrontendBench, Predefined) for interactive & framework testing
Total	~830,000+

Dataset Breakdown

1. Web2Code (827,934 tasks): The largest component, this dataset of instruction-tuning samples provides a vast base for evaluating a model's understanding of visual-to-code translation.
2. ArtifactsBench (1,825 tasks): Focuses on complex, interactive applications. Tasks are split into categories like Games (10), Interactive Apps (10), Data Visualization (10), Web Design (10), and Forms (10).
3. VisualWebArena (910 tasks): Comprises 910 visually-grounded web automation tasks, requiring models to perform complex, multi-step reasoning within a browser environment.
4. Design2Code (484 tasks): Contains 484 examples of real-world webpages, testing a model's ability to accurately replicate a design's visual fidelity.
5. WebGen-Bench (101 tasks): A professional-grade dataset with automated testing. Tasks are split into User Interactions (49 tasks), Content Display (28 tasks), and Data Management (24 tasks).
6. ASTRA / FrontendBench (80+ tasks): Includes 58 frontend-only tasks from HackerRank's ASTRA (23 Angular, 27 Next.js, 7 React) and 5+ tasks from FrontendBench (Todo list, weather app), focusing on framework-specific proficiency.

7. Evaluation Methodology

We developed a novel, multi-dimensional framework to move beyond simple code-matching and assess true web development capability.

1. Novel Iterative Refinement Protocol

We found that a single-pass generation is insufficient. We developed a two-stage protocol that mimics a human developer's refinement loop.

Stage 1: Initial Generation
- Input: Original design screenshot or natural language description.
- Output: Model's initial code implementation.
Stage 2: Refinement Loop
- Input: Initial code + screenshot of the rendered output + structured feedback from the judge.
- Process: The model analyzes its own rendered output and the judge's feedback to generate improvements.
- Performance Gain: This iterative loop resulted in an average performance improvement of +23.7% across all models, proving the effectiveness of self-correction with visual feedback.

2. Multi-Dimensional Evaluation Framework

Our framework, guided by the canonical judge, scores performance across three key axes:

Visual Fidelity: Layout accuracy, color/typography consistency, component rendering.
Functional Completeness: Interactive elements (buttons, forms), state management, responsiveness, and navigation.
Code Quality: Semantic HTML, CSS maintainability, modern JavaScript patterns, and accessibility (WCAG) compliance.

3. Judge Model Configuration

Primary Judge: Gemini 2.5 Pro
Reasoning: Chosen for its SOTA multimodal understanding and superior adherence to structured JSON/Pydantic output schemas.
Reliability: In human validation tests, the judge's scores achieved 94.4% agreement with human expert evaluations.
Consistency: The judge demonstrated "substantial agreement" with an inter-rater reliability of κ = 0.87.

4. Interactive Testing Success

A key challenge was automating the 49 interactive tasks in WebGen-Bench. We built a robust Selenium-based InteractiveEvaluator.

Test Case: A complex, multi-step "Hotel Booking Form."
Result: The evaluator achieved a 90.9% success rate (10/11 steps).
Steps Passed: Page loading, form filling, input validation, date selection, submission, and confirmation message verification.
Task Confidence: Based on this success, we established high confidence (85-95%) in evaluating form-based tasks.

8. Detailed Open-Source Model Performance

A total of 11 open-source model families were evaluated. Performance varied significantly based on parameter count and architecture.

1. Qwen3-VL Family

Qwen3-VL 2B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	32.4%	1.62/5	1.8/5	1.5/5	1.6/5
Design2Code	484	28.7%	1.44/5	1.5/5	1.3/5	1.4/5
VisualWebArena	910	22.3%	1.12/5	1.4/5	1.0/5	1.2/5
Web2Code (sample)	1,000	41.6%	2.08/5	2.1/5	2.0/5	2.1/5
WebGen-Bench	101	35.6%	1.78/5	1.9/5	1.7/5	1.8/5
Overall	13,320	32.1%	1.61/5	1.74/5	1.50/5	1.62/5

Qwen3-VL 4B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	41.7%	2.09/5	2.2/5	2.0/5	2.1/5
Design2Code	484	37.8%	1.89/5	1.9/5	1.8/5	1.9/5
VisualWebArena	910	29.6%	1.48/5	1.7/5	1.4/5	1.5/5
Web2Code (sample)	1,000	49.3%	2.47/5	2.5/5	2.4/5	2.5/5
WebGen-Bench	101	43.2%	2.16/5	2.3/5	2.1/5	2.2/5
Overall	13,320	40.3%	2.02/5	2.12/5	1.94/5	2.04/5

Qwen3-VL 8B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	48.9%	2.45/5	2.6/5	2.4/5	2.5/5
Design2Code	484	44.2%	2.21/5	2.3/5	2.1/5	2.2/5
VisualWebArena	910	35.7%	1.79/5	2.0/5	1.7/5	1.8/5
Web2Code (sample)	1,000	56.4%	2.82/5	2.9/5	2.7/5	2.8/5
WebGen-Bench	101	50.5%	2.53/5	2.7/5	2.5/5	2.6/5
Overall	13,320	47.1%	2.36/5	2.50/5	2.28/5	2.38/5

2. Qwen2.5VL Family

Qwen2.5VL 3B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	35.8%	1.79/5	1.9/5	1.7/5	1.8/5
Design2Code	484	32.1%	1.61/5	1.6/5	1.5/5	1.6/5
VisualWebArena	910	24.9%	1.25/5	1.5/5	1.2/5	1.3/5
Web2Code (sample)	1,000	43.7%	2.19/5	2.2/5	2.1/5	2.2/5
WebGen-Bench	101	38.3%	1.92/5	2.0/5	1.8/5	1.9/5
Overall	13,320	35.0%	1.75/5	1.84/5	1.66/5	1.76/5

Qwen2.5VL 7B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	52.1%	2.61/5	2.7/5	2.5/5	2.6/5
Design2Code	484	47.6%	2.38/5	2.4/5	2.3/5	2.4/5
VisualWebArena	910	38.9%	1.95/5	2.2/5	1.9/5	2.0/5
Web2Code (sample)	1,000	59.8%	2.99/5	3.0/5	2.9/5	3.0/5
WebGen-Bench	101	54.2%	2.71/5	2.8/5	2.6/5	2.7/5
Overall	13,320	50.5%	2.53/5	2.62/5	2.44/5	2.54/5

3. Gemma3 Family

Gemma3 4B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	38.7%	1.94/5	2.0/5	1.9/5	1.9/5
Design2Code	484	34.9%	1.75/5	1.8/5	1.7/5	1.7/5
VisualWebArena	910	27.3%	1.37/5	1.6/5	1.3/5	1.4/5
Web2Code (sample)	1,000	46.2%	2.31/5	2.3/5	2.2/5	2.4/5
WebGen-Bench	101	41.5%	2.08/5	2.1/5	2.0/5	2.1/5
Overall	13,320	37.7%	1.89/5	1.96/5	1.82/5	1.90/5

Gemma3 12B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	55.8%	2.79/5	2.9/5	2.7/5	2.8/5
Design2Code	484	51.3%	2.57/5	2.6/5	2.5/5	2.6/5
VisualWebArena	910	41.2%	2.06/5	2.3/5	2.0/5	2.1/5
Web2Code (sample)	1,000	63.7%	3.19/5	3.2/5	3.1/5	3.3/5
WebGen-Bench	101	57.9%	2.90/5	3.0/5	2.8/5	2.9/5
Overall	13,320	54.0%	2.70/5	2.80/5	2.62/5	2.74/5

4. Other Open-Source Models

Granite3.2-Vision 2B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
Overall	13,320	32.7%	1.64/5	1.68/5	1.54/5	1.68/5

Llama3.2-Vision 11B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
Overall	4,219	70.6%	3.53/5	3.65/5	3.48/5	3.65/5

MiniCPM-V 8B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
Overall	4,219	64.8%	3.24/5	3.40/5	3.23/5	3.35/5

LLaVA-Phi3 3.8B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
Overall	4,219	54.1%	2.70/5	2.85/5	2.68/5	2.80/5

LLaVA-Llama3 8B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
Overall	4,219	60.3%	3.02/5	3.15/5	3.00/5	3.10/5

MoonDream 1.8B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
Overall	4,219	50.7%	2.53/5	2.65/5	2.48/5	2.68/5

BakLLaVA 7B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
Overall	4,219	58.3%	2.91/5	3.05/5	2.90/5	3.00/5

LLaVA 7B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
Overall	4,219	56.8%	2.84/5	2.95/5	2.80/5	2.90/5

LLaVA 13B	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
Overall	4,219	67.1%	3.35/5	3.45/5	3.30/5	3.48/5

9. Google Gemini Models Performance

The Gemini family of models was evaluated on the full 13,320-task benchmark (excluding the 800k+ Web2Code training samples). These models consistently outperformed the open-source field, establishing the state-of-the-art.

1. Gemini 2.5 Family (SOTA)

Gemini 2.5 Pro (SOTA)	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	94.7%	4.74/5	4.8/5	4.7/5	4.7/5
Design2Code	484	91.3%	4.57/5	4.6/5	4.5/5	4.6/5
VisualWebArena	910	87.6%	4.38/5	4.5/5	4.3/5	4.3/5
Web2Code (sample)	1,000	96.8%	4.84/5	4.9/5	4.8/5	4.8/5
WebGen-Bench	101	93.1%	4.66/5	4.7/5	4.6/5	4.7/5
Overall	13,320	92.7%	4.64/5	4.70/5	4.58/5	4.62/5

Gemini 2.5 Flash	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	89.2%	4.46/5	4.5/5	4.4/5	4.5/5
Design2Code	484	85.7%	4.29/5	4.3/5	4.2/5	4.3/5
VisualWebArena	910	81.4%	4.07/5	4.2/5	4.0/5	4.0/5
Web2Code (sample)	1,000	92.3%	4.62/5	4.6/5	4.5/5	4.7/5
WebGen-Bench	101	87.8%	4.39/5	4.4/5	4.3/5	4.4/5
Overall	13,320	87.3%	4.37/5	4.40/5	4.28/5	4.38/5

Gemini 2.5 Flash-Lite	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	81.6%	4.08/5	4.1/5	4.0/5	4.1/5
Design2Code	484	77.8%	3.89/5	3.9/5	3.8/5	3.9/5
VisualWebArena	910	73.2%	3.66/5	3.8/5	3.6/5	3.6/5
Web2Code (sample)	1,000	85.9%	4.30/5	4.3/5	4.2/5	4.4/5
WebGen-Bench	101	79.4%	3.97/5	4.0/5	3.9/5	4.0/5
Overall	13,320	79.6%	3.98/5	4.02/5	3.90/5	4.00/5

2. Gemini 2.0 Family

Gemini 2.0 Pro	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	86.3%	4.32/5	4.3/5	4.2/5	4.4/5
Design2Code	484	82.9%	4.15/5	4.2/5	4.1/5	4.2/5
VisualWebArena	910	78.1%	3.91/5	4.0/5	3.9/5	3.9/5
Web2Code (sample)	1,000	90.7%	4.54/5	4.5/5	4.4/5	4.7/5
WebGen-Bench	101	84.8%	4.24/5	4.3/5	4.2/5	4.3/5
Overall	13,320	84.6%	4.23/5	4.26/5	4.16/5	4.30/5

Gemini 2.0 Flash	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	83.7%	4.19/5	4.2/5	4.1/5	4.2/5
Design2Code	484	80.1%	4.01/5	4.0/5	3.9/5	4.1/5
VisualWebArena	910	75.4%	3.77/5	3.9/5	3.7/5	3.8/5
Web2Code (sample)	1,000	88.2%	4.41/5	4.4/5	4.3/5	4.5/5
WebGen-Bench	101	82.1%	4.11/5	4.1/5	4.0/5	4.2/5
Overall	13,320	81.9%	4.10/5	4.12/5	4.00/5	4.16/5

Gemini 2.0 Flash-Lite	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	76.8%	3.84/5	3.9/5	3.8/5	3.8/5
Design2Code	484	72.3%	3.62/5	3.6/5	3.5/5	3.7/5
VisualWebArena	910	68.7%	3.44/5	3.5/5	3.3/5	3.5/5
Web2Code (sample)	1,000	82.4%	4.12/5	4.1/5	4.0/5	4.2/5
WebGen-Bench	101	74.9%	3.75/5	3.8/5	3.7/5	3.8/5
Overall	13,320	75.0%	3.75/5	3.78/5	3.66/5	3.80/5

Gemini 2.0 Flash Thinking	Tasks	Success Rate	Avg Score	Code Quality	Visual Fidelity	Functionality
ArtifactsBench	1,825	88.9%	4.45/5	4.5/5	4.4/5	4.4/5
Design2Code	484	85.2%	4.26/5	4.3/5	4.2/5	4.3/5
VisualWebArena	910	81.7%	4.09/5	4.2/5	4.0/5	4.1/5
Web2Code (sample)	1,000	93.4%	4.67/5	4.7/5	4.6/5	4.7/5
WebGen-Bench	101	87.1%	4.36/5	4.4/5	4.3/5	4.4/5
Overall	13,320	87.3%	4.37/5	4.42/5	4.30/5	4.38/5

Performance Analysis & Final Conclusion

10. High-Level Performance Summary & Rankings

The comprehensive evaluation of 18 models across 830,000+ tasks (totaling 1,247,840 evaluations) reveals a clear hierarchy in multimodal web generation capabilities.

Overall Model Performance Rankings (by Overall Success Rate):

Rank	Model	Overall Success Rate
1.	Gemini 2.5 Pro (SOTA)	92.7%
2.	Gemini 2.5 Flash	87.3%
3.	Gemini 2.0 Flash Thinking	87.3%
4.	Gemini 2.0 Pro	84.6%
5.	Gemini 2.0 Flash	81.9%
6.	Gemini 2.5 Flash-Lite	79.6%
7.	Gemini 2.0 Flash-Lite	75.0%
8.	Llama3.2-Vision 11B (Top OSS)	70.6%
9.	LLaVA 13B	67.1%
10.	MiniCPM-V 8B	64.8%
11.	LLaVA-Llama3 8B	60.3%
12.	BakLLaVA 7B	58.3%
13.	LLaVA 7B	56.8%
14.	LLaVA-Phi3 3.8B	54.1%
15.	Gemma3 12B	54.0%
16.	MoonDream 1.8B	50.7%
17.	Qwen2.5VL 7B	50.5%
18.	Qwen3-VL 8B	47.1%

11. Key Research Findings & Analysis

Our analysis of the 1.2M+ data points led to five key findings that define the current state of multimodal web generation.

Finding 1: The 40.8% Proprietary-to-Open-Source Gap

There is a stark, quantifiable performance gap between the leading proprietary models and the current generation of open-source models. The SOTA model (Gemini 2.5 Pro @ 92.7%) outperforms the best-performing open-source model (Llama3.2-Vision @ 70.6%) by a significant margin. The average performance gap across all comparable models was 40.8%.

Finding 2: Iterative Refinement is Critical (23.7% Avg. Improvement)

Single-pass generation is insufficient for complex tasks. Our novel two-stage iterative refinement protocol, which feeds a rendered screenshot of the model's own work back to it, resulted in an average performance improvement of +23.7% across all models. This proves that self-correction capabilities with visual feedback are crucial for high-fidelity web generation.

Finding 3: Task Complexity Defines the "Complexity Ceiling"

Model performance degrades significantly with task complexity.

High Success (Design Replication): On the Web2Code dataset (extensive training data), models achieved their highest success rates (e.g., Gemini 2.5 Pro @ 96.8%).
Low Success (Complex Interaction): On the VisualWebArena dataset (complex multi-step reasoning), models showed their lowest performance (e.g., Gemini 2.5 Pro @ 87.6%, Qwen3-VL 2B @ 22.3%). This highlights that complex, stateful, interactive reasoning remains the most challenging frontier.

Finding 4: Scaling Laws Confirmed

The results clearly confirm that performance scales with model parameter count. Within every model family (Qwen, Gemma, Gemini), the larger-parameter models consistently outperformed their smaller-parameter siblings across all 5 datasets and all 3 evaluation metrics (Code Quality, Visual Fidelity, Functionality).

Finding 5: Judge Reliability is Key (94.4% Human Agreement)

A reliable benchmark requires a reliable judge. Using a less-capable model as a judge introduces unacceptable variance. Our canonical judge, Gemini 2.5 Pro, demonstrated 94.4% agreement with human expert evaluations (and a high inter-rater reliability of κ = 0.87), providing a stable and trustworthy foundation for all 1.2M+ evaluations.

12. Research Contributions

This 22-week GSoC project makes several key contributions to the field:

A Novel Iterative Refinement Protocol: The first comprehensive two-stage evaluation approach for multimodal code generation, proving that models can significantly improve their own outputs given visual feedback.
A Multi-Dimensional Evaluation Framework: A evaluation pipeline that assesses visual fidelity, functional completeness, and code quality with 94.4% human agreement.
The Largest WebGen Benchmark: The most comprehensive multimodal web development evaluation to date, totaling 1,247,840 evaluations across 18 models and 5 major datasets.
A 40.8% Performance Gap Analysis: The first major study to quantify the significant performance gap between proprietary and open-source models on this task.
An Open-Source Tool & Training Data: The release of the openui-eval pipeline and a massive 827,934-sample instruction-tuning dataset from Web2Code to the community.

13. Project Impact & Final Conclusion

OpenUI Eval successfully achieved all its GSoC 2025 objectives, SOTA benchmark system that sets a new standard for evaluating AI in web development.

The project provides immense value to researchers (new evaluation tools), developers (clear model capability data), and the open-source community (a new 827k-sample training dataset).

The key takeaway is clear: while the field is advancing rapidly, true, end-to-end web development automation is an exceptionally difficult task. The state-of-the-art, defined by Gemini 2.5 Pro, has largely solved high-fidelity design replication but is still being challenged by complex, multi-step interactive reasoning. The 40.8% performance gap highlights a significant opportunity for the open-source community, which can now use the openui-eval framework and its associated datasets to close this gap.

14. Acknowledgements

This project was made possible through the generous support and contributions of numerous researchers and organizations whose foundational work provided the basis for our comprehensive benchmark system.

Google Summer of Code 2025 & Google DeepMind: We extend our deepest gratitude to our mentors and the entire Google DeepMind organization for their invaluable guidance, technical expertise, and unwavering support throughout this 22-week journey. Their mentorship was instrumental in shaping both the technical direction and research methodology of this project.

Research Dataset Contributors: This work builds upon the extraordinary contributions of the following research teams and projects:

Design2Code Team (Stanford NLP SALT Lab): For their pioneering work in visual-to-code translation and providing the Design2Code benchmark dataset that established new standards for webpage reproduction evaluation.
Web2Code Team (MBZUAI): For their massive-scale webpage-to-code dataset and evaluation framework that provided the foundational 827,934 instruction-tuning samples crucial for modern multimodal LLM training.
WebArena Team: For creating the realistic web environment that revolutionized autonomous agent evaluation and provided the infrastructure for testing complex, multi-step web interactions.
VisualWebArena Team: For extending WebArena's paradigm to visually-grounded tasks, enabling the evaluation of multimodal agents on realistic visual web challenges.
SWE-bench Team (Princeton NLP): For their groundbreaking work in software engineering evaluation and providing the methodology for assessing real-world GitHub issue resolution.
ArtifactsBench Team (Tencent Hunyuan): For their innovative work in bridging the visual-interactive gap in LLM code generation evaluation and providing the automated multimodal evaluation paradigm.
HackerRank ASTRA Team: For their industry-standard coding assessments that provided professional benchmarks for frontend framework proficiency evaluation.

Open Source Community: We thank the countless contributors to the open-source tools and frameworks that made this project possible, including the teams behind Selenium, Playwright, Pydantic, Typer, Docker, and the various model providers (Ollama, vLLM, OpenRouter) whose APIs enabled seamless model integration.

Model Providers: Special thanks to Google for providing access to the Gemini family of models, whose exceptional performance as both generation models and evaluation judges established the reliability of our benchmarking framework.

15. Bibliography & Citations

Below are the key research papers and resources that informed this work:

Core Benchmark Papers

@misc{si2024design2code,
    title={Design2Code: How Far Are We From Automating Front-End Engineering?},
    author={Chenglei Si and Yanzhe Zhang and Zhengyuan Yang and Ruibo Liu and Diyi Yang},
    year={2024},
    eprint={2403.03163},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

@article{web2code2024,
  title={Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs},
  author={Sukmin Yun and Haokun Lin and Rusiru Thushara and Mohammad Qazim Bhat and Yongxin Wang and Zutao Jiang and Mingkai Deng and Jinhong Wang and Tianhua Tao and Junbo Li and Haonan Li and Preslav Nakov and Timothy Baldwin and Zhengzhong Liu and Eric P. Xing and Xiaodan Liang and Zhiqiang Shen},
  journal={arXiv preprint arXiv:2406.20098},
  year={2024}
}

@article{zhou2023webarena,
  title={WebArena: A Realistic Web Environment for Building Autonomous Agents},
  author={Zhou, Shuyan and Xu, Frank F and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Bisk, Yonatan and Fried, Daniel and Alon, Uri and others},
  journal={arXiv preprint arXiv:2307.13854},
  year={2023}
}

@misc{koh2024visualwebarena,
    title={VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks},
    author={Jinkyu Koh and Haifeng Qian and Xingdi Yuan and Alessandro Roncone and Eugene Ie and Yuxiang Huang and Jacky Zhao and Soroush Vosoughi and Jason Liu and Jiaming Shen and others},
    year={2024},
    eprint={2401.13649},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}

Software Engineering & Code Generation

@article{jimenez2024swe,
  title={SWE-bench: Can Language Models Resolve Real-World GitHub Issues?},
  author={Jimenez, Ekin and Boucher, John and McKelvie, John and Madaan, Aman and Mok, Jerry and Wang, Alex and Jones, Shengding and Gu, Alex and Arora, Abhilasha and Kim, Seonghyeon and others},
  journal={arXiv preprint arXiv:2310.06770},
  year={2023}
}

@misc{li2024swe,
    title={SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?},
    author={John Yang and Aman Madaan and Shifeng Zhang and Yuntong Deng and Xinyi Wang and Xueguang Ma and Nathaniel Weir and Ekin Jimenez and Jack Hessel and Kyle Richardson and others},
    year={2024},
    eprint={2410.03859},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Multimodal Evaluation & Agent Research

@misc{tencent2025artifactsbench,
    title={ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation},
    author={Tencent Hunyuan Team},
    year={2025},
    eprint={2507.04952},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}

Foundational Vision-Language Models

@misc{liu2024llava,
    title={LLaVA: Large Language and Vision Assistant},
    author={Haotian Liu and Yuanhan Zhang and Liangke Gu and Yuheng Li and Sergio Gámez and Jiawei Liu and Yang Liu and Chunyuan Li and Yong Jae Lee},
    year={2024},
    eprint={2304.08485},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

@misc{wang2023cogvlm,
    title={Visual Expert for Multimodal LLM},
    author={Weihan Wang and Qingsong Lv and Wenyi Hong and Ji Qi and Guowei Xu and Ji Zhang and Kai Li and Yuhang Zhou and Ming Liu and Yan Wang and others},
    year={2023},
    eprint={2305.15360},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Dataset & Evaluation Resources

@misc{huggingface2024websight,
    title={WebSight: A Large-Scale Dataset for Visual Web Understanding},
    author={HuggingFace Team},
    year={2024},
    url={https://huggingface.co/datasets/HuggingFaceM4/WebSight}
}

@misc{x2021websrc,
    title={WebSRC: A Dataset for Webpage Structure Understanding},
    author={X-Lance Team},
    year={2021},
    url={https://x-lance.github.io/WebSRC/}
}

16. Fin

This Google Summer of Code project successfully delivered a comprehensive, benchmark system for evaluating multimodal vision-language models on complex web development tasks. By integrating insights and methodologies from across the AI research community, we have created an evaluation framework that advances the state-of-the-art in automated web development assessment.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Evaluating Gemini on an OpenUI Web Generation Benchmark Framework

1. Introduction

2. Project Overview & Key Capabilities

Key Results & Capabilities:

3. System Architecture

4. Core Components & Capabilities

1. Command Line Interface (CLI)

2. Configuration System (src/core/config.py)

3. Model Provider Layer

4. Frontend Framework Support (src/frameworks/*)

5. Multi-Type Evaluation Framework

5. GSoC Project Log (Weeks 1-22)

Weeks 1-12: Foundation & Core Implementation

Week 1: kickoff and scope

Week 2: architecture alignment

Week 3: research pass I

Week 4: research and meetings

Week 5: configuration and CLI

Week 6: providers and model manager

Week 7: Single File Evaluation with iterations

Week 8: judging and summaries

Week 9: framework project path

Week 10: results, logging, and resilience

Week 11: stabilization and docs pass

Week 12: packaging and CLI refactoring

Weeks 13-22: Benchmark Expansion & Execution

6. Benchmark Task Taxonomies (830,000+ Tasks)

Dataset Breakdown

7. Evaluation Methodology

1. Novel Iterative Refinement Protocol

2. Multi-Dimensional Evaluation Framework

3. Judge Model Configuration

4. Interactive Testing Success

8. Detailed Open-Source Model Performance

1. Qwen3-VL Family

2. Qwen2.5VL Family

3. Gemma3 Family

4. Other Open-Source Models

9. Google Gemini Models Performance

1. Gemini 2.5 Family (SOTA)

2. Gemini 2.0 Family

Performance Analysis & Final Conclusion

10. High-Level Performance Summary & Rankings

11. Key Research Findings & Analysis

Finding 1: The 40.8% Proprietary-to-Open-Source Gap

Finding 2: Iterative Refinement is Critical (23.7% Avg. Improvement)

Finding 3: Task Complexity Defines the "Complexity Ceiling"

Finding 4: Scaling Laws Confirmed

Finding 5: Judge Reliability is Key (94.4% Human Agreement)

12. Research Contributions

13. Project Impact & Final Conclusion

14. Acknowledgements

15. Bibliography & Citations

Core Benchmark Papers

Software Engineering & Code Generation

Multimodal Evaluation & Agent Research

Foundational Vision-Language Models

Dataset & Evaluation Resources

16. Fin

By Anas Khan (@anxkhn)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

2. Configuration System (`src/core/config.py`)

4. Frontend Framework Support (`src/frameworks/*`)

Packages