A systematically curated and taxonomically organized knowledge base that unifies seven converging research frontiers of multimodal AI — encompassing 2000+ papers, 46 arena-ranked models, 100+ benchmarks, and 90+ datasets — providing researchers with a single, authoritative cross-disciplinary reference.
Explore the Compendium · Arena Leaderboard · Technical Blogs · Star on GitHub
If you find this resource valuable for your research, please consider giving us a ⭐ to increase its visibility.
- Overview
- Seven Research Pillars
- Arena Leaderboard Spotlight
- Benchmarks & Evaluation Metrics
- Datasets
- Documentation Structure
- Source Repositories
- Key Features
- Quick Start
- Contributing
- Citation
The field of multimodal artificial intelligence has undergone an unprecedented convergence of historically disparate research traditions. Text-to-image synthesis and text-to-video generation have evolved from nascent subfields into foundational pillars of generative AI, while image-to-video generation bridges static visual content with temporal dynamics through learned motion priors. 3D vision — underpinned by Neural Radiance Fields, 3D Gaussian Splatting, LLM-driven scene understanding, and visual SLAM — extends generative modeling into the spatial domain. 4D spatial intelligence introduces the temporal axis through depth estimation, dense 3D/4D tracking, and physics-grounded simulation. These areas intersect profoundly with world models that seek to learn predictive environment dynamics, and unified multimodal architectures that dissolve the boundary between visual perception and generation.
OpenMM-Arena serves as a definitive, centralized knowledge base that systematically organizes, cross-references, and catalogues this expansive and rapidly growing literature across seven research pillars:
┌──────────────────────────────────────┐
│ OpenMM-Arena │
│ Multimodal AI Research Compendium │
└──────────────┬───────────────────────┘
│
┌──────────┬──────────┬────────┼────────┬──────────┬──────────┐
▼ ▼ ▼ ▼ ▼ ▼ ▼
┌─────────┐┌─────────┐┌────────┐┌──────┐┌──────┐┌──────────┐┌────────┐
│ T2I ││ T2V ││ I2V ││ 3D ││ 4D ││ Unified ││ World │
│ 250+ ││ 200+ ││ 170+ ││ 500+ ││ 500+ ││ 120+ ││ 200+ │
│ papers ││ papers ││ papers ││papers││papers││ models ││ papers │
└────┬────┘└────┬────┘└───┬────┘└──┬───┘└──┬───┘└────┬─────┘└───┬────┘
│ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼ ▼
Models Foundation Animation 3DGS Depth Diffusion Theory
Control Controllable Editing NeRF Tracking AR Games
Editing Benchmarks Portraits SLAM Recon Hybrid Driving
Safety Datasets Transfer LLM-3D Dynamic Any2Any Embodied
Arena Robot Human Eval Sim
Benchmarks Nav Physics Datasets MBRL
2000+ Papers · 7 Research Pillars · 46 Arena-Ranked Models · 100+ Benchmarks · 90+ Datasets · 73 Curated Blog Posts · 11 Source Repositories
250+ papers · 46 arena-ranked models · 6 sub-domains
The progressive evolution of text-conditioned image synthesis — from generative adversarial networks through denoising diffusion models to autoregressive transformers. Encompasses foundational architectures (FLUX.2, Seedream 3.0, GPT-Image, Imagen 3), controllable generation via spatial and semantic conditioning (ControlNet, GLIGEN), text-guided editing and subject-driven personalization (InstructPix2Pix, DreamBooth), safety and bias mitigation, and comprehensive arena leaderboards derived from 3.9M+ human preference votes.
Sub-domains: Models & Face Synthesis · Control & Composition · Editing & Personalization · Safety & Applications · Cross-Modal Extensions · Arena & Benchmarks
200+ papers · 25+ datasets · 3 sub-domains
Foundation video synthesis models spanning GANs to diffusion transformers — covering Sora, Wan 2.1, Veo 2, CogVideoX, and contemporary controllable and efficient video generation approaches. Addresses long-form video synthesis, temporal coherence benchmarks, and large-scale video-text corpora.
Sub-domains: Foundation T2V Models · Controllable & Efficient Synthesis · Benchmarks & Datasets
170+ papers · 20+ applications · 2 sub-domains
The synthesis of dynamic video sequences from static images via learned motion priors — encompassing image animation, character-driven video synthesis, talking-head generation, temporally consistent video editing, motion transfer, and audio-driven synthesis. (Documented within the T2V reference.)
Sub-domains: Animation & Portraits · Video Editing & Enhancement
500+ papers · 6 sub-domains
A comprehensive survey spanning 3D Gaussian Splatting, Neural Radiance Fields (NeRF), text/image-to-3D generation, LLM-driven 3D understanding, NeRF-SLAM, GS-SLAM, visual and LiDAR SLAM, robotic manipulation, autonomous navigation, and spatial localization. Extends generative AI into the spatial domain, forging connections to real-time robotics and embodied agents.
Sub-domains: 3D Gaussian Splatting · NeRF & Generation · NeRF-SLAM · GS-SLAM · LLM-3D Understanding · Robotics & Navigation
500+ papers · 6 sub-domains
The temporal dimension of 3D understanding — monocular and multi-view depth estimation, camera pose recovery, dense 3D/4D point tracking, scene reconstruction, 4D dynamic scenes via deformable NeRFs and 4D Gaussian Splatting, human-centric motion capture, and physics-grounded simulation.
Sub-domains: Geometry & Depth · 3D/4D Tracking · Reconstruction · Dynamic Scenes · Human-Centric · Physics-Based Simulation
120+ models · 30+ benchmarks · 3 sub-domains
Architectures that jointly perform visual understanding and generation within a single parametric framework — diffusion-based unified models, autoregressive multimodal LLMs with pixel and semantic encoding, hybrid AR-diffusion architectures, and any-to-any paradigms. Complemented by comprehensive evaluation benchmarks and curated training corpora.
Sub-domains: Models & Architectures · Datasets & Training Corpora · Evaluation & Benchmarks
200+ papers · 6 theory themes · 6 sub-domains
Learned environment dynamics for game simulation, autonomous driving, embodied manipulation, model-based reinforcement learning, video generation as world simulation, and theoretical underpinnings. Traces the lineage from Ha & Schmidhuber's seminal formulation through contemporary systems such as Sora, NVIDIA Cosmos, and Genie.
Sub-domains: Theory & Surveys · Game Simulation · Video Generation · LiDAR Generation · Occupancy Generation · Embodied AI
Source: LM Arena · 3.9M+ human preference votes · 46 models · Updated February 2026
| Rank | Model | Organization | Elo Score | License |
|---|---|---|---|---|
| 1 | GPT-Image-1.5 | OpenAI | 1248 | Proprietary |
| 2 | Gemini-3-Pro Image 2K | 1237 | Proprietary | |
| 3 | Seedream 3.0 | ByteDance | 1233 | Proprietary |
| 4 | Grok Imagine Image | xAI | 1174 | Proprietary |
| 5 | FLUX.2 Max | Black Forest Labs | 1169 | Proprietary |
| 6 | Grok Imagine Image Pro | xAI | 1166 | Proprietary |
| 7 | FLUX.2 Flex | Black Forest Labs | 1158 | Proprietary |
| 8 | Gemini 2.5 Flash Image | 1157 | Proprietary | |
| 9 | FLUX.2 Pro | Black Forest Labs | 1156 | Proprietary |
| 10 | HunyuanImage 3.0 | Tencent | 1151 | Community |
View Complete Leaderboard (46 Models) →
| Finding | Analysis |
|---|---|
| Proprietary-model dominance | The top five positions are held by OpenAI, Google, ByteDance, and xAI, indicating that proprietary training infrastructure and data remain decisive advantages |
| Rapid intra-family progress | GPT-Image-1.5 (Elo 1248) surpasses GPT-Image-1 (Elo 1115) by 133 points — a substantial within-family improvement |
| Narrowing open-weight gap | Qwen-Image (Apache 2.0, Elo 1139, rank 15) and FLUX.2 Klein 4B (Apache 2.0, Elo 1021) demonstrate that open-weight models are increasingly competitive |
| Growing Chinese-lab presence | Models from Alibaba, ByteDance, and Tencent now occupy multiple top-20 positions, reflecting significant investment in multimodal generation research |
| Category | Representative Benchmarks |
|---|---|
| Multimodal Understanding | General-Bench, MMMU, MM-Vet v2, MMBench, SEED-Bench-2, GQA |
| Image Generation Quality | GenExam, KRIS-Bench, WISE, DreamBench++, T2I-CompBench++, GenAI-Bench, TIFA, HEIM |
| World Model Evaluation | WorldScore, WorldSimBench, PhyWorld, Newton, WorldGym, EWMBench |
| Interleaved Generation | UniBench, OpenING, ISG, MMIE |
| Human-Preference Rankings | LM Arena T2I (3.9M votes), Artificial Analysis T2I (15 style categories) |
| Metric | Full Name | Primary Domain |
|---|---|---|
| FID | Fréchet Inception Distance | Image Generation Quality |
| CLIP Score | CLIP-based Text–Image Alignment Score | Text-to-Image Faithfulness |
| VQAScore | VQA-based Compositional Score | T2I Semantic Faithfulness |
| HPSv2 | Human Preference Score v2 | Human Preference Alignment |
| ImageReward | Image Reward Model | Human Preference Alignment |
| LPIPS | Learned Perceptual Image Patch Similarity | Perceptual Image Quality |
| TIFA | Text-to-Image Faithfulness Assessment | Attribute Faithfulness |
| DSG | Davidsonian Scene Graph | Compositional Correctness |
| SSIM | Structural Similarity Index Measure | Structural Image Quality |
90+ datasets catalogued across all modalities
| Category | Representative Datasets | Approximate Scale |
|---|---|---|
| Multimodal Understanding | LAION-5B, DataComp, Infinity-MM, Cambrian-10M | 5.9B – 10M samples |
| Text-to-Image | LAION-Aesthetics, PixelProse, PD12M, CC-12M, SAM | 120M – 11M samples |
| Image Editing | ByteMorph-6M, UltraEdit, AnyEdit, ImgEdit | 6M – 1.2M samples |
| Interleaved Multimodal | OmniCorpus (8B), OBELICS (141M), Multimodal C4 (101M) | 8B – 101M samples |
| Video-Text | WebVid-10M, InternVid, HD-VILA-100M, Panda-70M | 100M – 10M samples |
OpenMM-Arena/
├── README.md ← This document (project overview)
├── docs/
│ ├── T2I.md ← Text-to-Image: in-depth reference
│ ├── T2V.md ← Text-to-Video & Image-to-Video: in-depth reference
│ ├── 3D_VISION.md ← 3D Vision: in-depth reference
│ ├── 4D_VISION.md ← 4D Spatial Intelligence: in-depth reference
│ ├── UNIFIED.md ← Unified Multimodal Models: in-depth reference
│ ├── WORLD_MODELS.md ← World Models: in-depth reference
│ ├── ARENA.md ← Arena Leaderboard: full rankings & analysis
│ │
│ ├── index.html ← Website: landing page
│ ├── blog.html ← Website: curated technical blog posts (73 posts)
│ ├── arena.html ← Website: interactive arena leaderboards
│ ├── style.css ← Website: design system
│ ├── script.js ← Website: interactive features & search
│ ├── sitemap.xml ← Website: SEO sitemap
│ ├── robots.txt ← Website: crawler directives
│ ├── 404.html ← Website: custom error page
│ ├── t2i.html / t2v.html ... ← Website: pillar hub pages
│ ├── t2i/ t2v/ i2v/ ... ← Website: sub-domain detail pages
│ └── blog/ ← Website: all blog posts page
| Document | Content Scope | Coverage |
|---|---|---|
| T2I.md | Foundational architectures, face synthesis, controllable generation, editing, safety, cross-modal extensions, arena rankings | 250+ papers |
| T2V.md | Foundation T2V models, controllable synthesis, I2V animation, video editing, benchmarks, datasets | 370+ papers |
| 3D_VISION.md | 3D Gaussian Splatting, NeRF, SLAM, LLM-3D understanding, robotics, navigation | 500+ papers |
| 4D_VISION.md | Depth estimation, 3D/4D tracking, reconstruction, dynamic scenes, human motion, physics simulation | 500+ papers |
| UNIFIED.md | Diffusion, autoregressive, hybrid, any-to-any unified models, evaluation benchmarks | 120+ models |
| WORLD_MODELS.md | Theory, game simulation, autonomous driving, video generation, embodied AI, MBRL | 200+ papers |
| ARENA.md | LM Arena & Artificial Analysis leaderboards, 46 ranked models, Elo methodology | 46 models |
OpenMM-Arena systematically consolidates and cross-references knowledge from 11 foundational open-source repositories:
| Repository | Primary Domain | Maintainer |
|---|---|---|
| Awesome-Text-to-Image | Text-to-Image Generation | Yutong Zhou et al. |
| Awesome-T2V-Generation | Text-to-Video Generation | soraw-ai |
| Awesome-Video-Diffusion | Video Diffusion Models | ShowLab |
| Awesome-3DGS | 3D Gaussian Splatting | MrNeRF |
| Awesome-3DReconstruction | 3D Reconstruction | OpenMVG |
| Awesome-LLM-3D | LLM-3D Understanding | ActiveVisionLab |
| Awesome-3D-Vision | 3D Vision (General) | Hardy-Uint |
| Awesome-NeRF-3DGS-SLAM | NeRF & 3DGS-based SLAM | 3D-Vision-World |
| Awesome-4D-SI | 4D Spatial Intelligence | Yukang Cao |
| Awesome-World-Models | World Models | Siqiao Huang et al. |
| Awesome-Unified-MM | Unified Multimodal Models | AIDC-AI / Zhang et al. |
| Feature | Description |
|---|---|
| 2000+ Papers Catalogued | Among the most comprehensive multimodal AI paper collections available — meticulously organized with full bibliographic citations, arXiv links, and venue information |
| Taxonomic Organization | Navigate a carefully structured hierarchy — from high-level research pillars through thematic sub-domains to individual papers, with principled categorization by chronological era and methodological paradigm |
| Arena Leaderboards | Real-time rankings derived from 3.9M+ human preference votes — enabling head-to-head model comparison via Elo scores, win rates, and comprehensive performance statistics |
| 73 Curated Blog Posts | An anthology of authoritative technical blog posts from leading AI laboratories (BFL, Google DeepMind, OpenAI, Meta AI, NVIDIA, Stability AI, ByteDance, and others) |
| Smart Search & Filtering | Real-time search, sortable tables, year-based filtering, and keyboard shortcuts for efficient navigation |
| Dark Mode | Full dark-mode support with a carefully designed color scheme preserving readability and contrast |
| Responsive Design | Optimized for desktop, tablet, and mobile viewing with collapsible sidebar navigation |
| Continuously Updated | Maintained in pace with the latest developments — new publications from CVPR, ICLR, NeurIPS, ECCV, and arXiv are integrated regularly |
| Open Source | Released under the MIT License — community-driven, with contributions welcome from the global research community |
Browse online (recommended):
Run locally:
git clone https://github.com/OpenEnvision-Lab/OpenMM-Arena.git
cd OpenMM-Arena/docs
python -m http.server 8000
# Navigate to http://localhost:8000 in your browserKeyboard shortcuts (on the website):
| Key | Action |
|---|---|
Cmd/Ctrl + K |
Open global search |
D |
Toggle dark mode |
[ |
Toggle sidebar |
? |
Display all shortcuts |
We welcome contributions from the research community. There are several ways to participate:
- Adding papers — Submit a pull request incorporating new publications into any of the seven research pillars
- Updating leaderboards — Help maintain current arena rankings as new models are evaluated
- Correcting errors — Report or resolve citation errors, broken links, or taxonomic misclassifications
- Proposing improvements — Open an issue with suggestions for better organization or new content areas
- Disseminating the resource — Star the repository, share on academic and social platforms, and cite in your publications
If you find OpenMM-Arena valuable for your research, please consider:
- Starring this repository to increase its visibility within the GitHub community
- Sharing on academic platforms — Twitter/X, Reddit r/MachineLearning, Hacker News, Papers With Code
- Citing in your publications — see the Citation section below
- Cross-referencing — if you maintain a related awesome list or survey, consider linking to OpenMM-Arena
For repository maintainers: please add these topics in the GitHub repository settings to maximize discoverability.
multimodal-ai text-to-image text-to-video image-to-video 3d-vision 4d-vision world-models diffusion-models generative-ai computer-vision deep-learning awesome-list neural-radiance-fields gaussian-splatting research-papers benchmark leaderboard arena survey paper-collection
If you find OpenMM-Arena useful for your research, please consider citing:
@misc{openmmarena2025,
title = {OpenMM-Arena: A Comprehensive Compendium for Multimodal AI Research},
author = {OpenMM-Arena Contributors},
year = {2025},
journal = {GitHub repository},
url = {https://github.com/OpenEnvision-Lab/OpenMM-Arena}
}Cite foundational works
@inproceedings{zhou2023vision,
title = {Vision + Language Applications: A Survey},
author = {Zhou, Yutong and Shimada, Nobutaka},
booktitle = {CVPRW},
year = {2023}
}
@misc{huang2025awesomeworldmodels,
title = {Awesome-World-Models},
author = {Siqiao Huang},
year = {2025},
url = {https://github.com/knightnemo/Awesome-World-Models}
}
@article{zhang2025unified,
title = {Unified Multimodal Understanding and Generation Models},
author = {Zhang, Xinjie and others},
journal = {arXiv preprint arXiv:2505.02567},
year = {2025}
}