HackIllinois 2026 β π₯ Modal Track Β· Best Use of Gemini API Β· Most Creative
Point any phone at a room β no app, no hardware β and a live 3D map builds in real time on Modal's H100 GPU. An AI agent finds objects by name, pins them in 3D, and answers spatial questions using real geometry.
Live Demo Β· How It Works Β· Deploy It Yourself Β· Architecture
Every year, first responders walk into buildings they've never seen. Search-and-rescue teams navigate rubble without a map. Visually impaired people enter new spaces with no spatial context.
AI could help β if it could actually see.
Today's AI agents are spatially blind. They process pixels and text, but have no concept of where things are in 3D space. The tools that do provide spatial maps β LiDAR rigs, depth cameras, survey hardware β cost tens of thousands of dollars and require trained operators. Spatial intelligence has stayed locked inside robotics labs.
Open Reality breaks that barrier.
Open Reality is a cloud-native spatial AI platform. Describe your task. Point a phone β or drop a video, or upload a folder of images. An AI agent maps your space in real time, finds your targets, and answers spatial questions β from any browser, from anywhere.
| Mode | How |
|---|---|
| Live Camera | Open a link on any phone browser. No app install. Stream directly. |
| Video File | Upload a recorded walkthrough β MP4, MOV, or any standard format. |
| Image Folder | Drop a folder of frames (e.g. extracted with ffmpeg). Full offline SLAM. |
- Describe your mission β "I'm a paramedic doing a safety sweep."
- Get a spatial plan β the AI agent generates what to look for and how to move through the space.
- Capture the space β live camera, uploaded video, or image folder.
- Walk the space β a dense 3D point cloud builds live as you move.
- The agent finds your targets β objects are detected, pinned in 3D, and available for spatial Q&A.
Open Reality doesn't just map. It reasons about what it sees.
- Intent-driven planning β before a frame is captured, the agent interprets your goal and builds a typed spatial action plan. A firefighter's plan surfaces extinguishers and standpipe connections. A crime scene investigator's plan surfaces evidence markers. The agent understands the difference.
- Continuous detection β as the map grows, the agent automatically scans every new submap for your targets. No user action needed.
- Retroactive re-search β add a new target mid-scan and the agent immediately re-runs detection on everything it's already seen. Its knowledge of the space updates instantly, backwards in time.
- Spatial Q&A β after the scan, the AI holds the full 3D context. "Is the fire extinguisher accessible from the north stairwell?" gets answered with actual geometry β not a guess.
The inference pipeline β a 1-billion-parameter vision model, CLIP scoring on every submap, SAM3 segmentation β is far too heavy for a laptop and too latency-sensitive for a slow API call. Modal makes real-time spatial AI possible.
| H100 GPU | Three heavy models in sequence at real-time speed. No dropped frames. |
| Warm Containers | No cold starts between WebSocket frames. Inference hits in under 100ms per submap. |
| Modal Volumes | VGGT-1B (4 GB) + DINO-Salad weights cached across runs. First inference in seconds, not minutes. |
| Modal Tunnel | Provides the HTTPS endpoint the phone camera requires to stream. Zero SSL config. |
| One Command | modal deploy modal_streaming.py β stable public URL, live for anyone. |
This is Modal the way it's meant to be used: serious inference, at real-time speed, accessible from a link.
Deploy your own instance of OpenReality in one command and get a stable public URL instantly.
modal deploy modal_streaming.pyModal will output a URL like https://yourname--vggt-slam-streaming-web.modal.run. Open that on any device. Point your phone at a room. Watch the 3D map build in real time. Ask the agent to find something.
No app. No hardware. Just a browser and your Modal account.
The same platform, tuned to radically different missions β just by changing the prompt.
| Scenario | Mission Prompt | What the Agent Finds |
|---|---|---|
| Active Fire Scene | "I'm a firefighter doing a sweep of a multi-story office building before full evacuation." | Fire extinguishers, standpipe connections, exit signs, gas shutoff valves, victims |
| Post-Natural Disaster | "I'm a search-and-rescue coordinator mapping a building after an earthquake." | Structural damage, blocked exits, trapped victims, hazardous material spills, load-bearing failures |
| Pre-Listing Property Walkthrough | "I'm a realtor doing a pre-listing inspection of a vacant house." | Scuffed walls, dated fixtures, worn carpet, stained grout, foggy window panes |
| Hotel Room Safety Inspection | "I'm a health inspector auditing a hotel room." | Mold, missing smoke detector components, broken seals, window latch failures |
| Post-Burglary Home Walkthrough | "I'm documenting damage after a break-in for an insurance claim." | Forced entry points, ransacked areas, broken displays, emptied safes |
| Construction Site Safety Audit | "I'm a site safety officer doing a compliance walkthrough." | Exposed rebar, unsecured ladders, open floor penetrations, missing guardrails, blocked exits |
| Flood-Damaged Home Inspection | "I'm a loss adjuster inspecting a home after indoor flooding." | Waterlogged surfaces, swollen drywall, delaminating cabinets, soft subfloor, standing water |
The hardware barrier is gone. The deployment barrier is gone. What remains is the mission.
git clone https://github.com/bdavidzhang/Real-Eyes.git
cd Real-Eyes
conda create -n open-reality python=3.11
conda activate open-reality
chmod +x setup.sh && ./setup.shThe setup script installs all dependencies and clones third-party models (VGGT, DINO-Salad, Perception Encoder, SAM3) into third_party/.
# One-command production deployment β stable URL, always-on
modal deploy modal_streaming.py
# Development mode with auto-reload
modal serve modal_streaming.py
# Pre-cache model weights (optional, speeds up first run)
modal run modal_streaming.py::app.download_modelsThe streaming server keeps an H100 container warm, serves the frontend as static files built at image creation time, and handles concurrent clients via WebSocket.
# Process a folder of images on a remote A100
modal run modal_app.py --image-folder ./office_loop --submap-size 16 --max-loops 1Uploads your images, runs full SLAM, and downloads poses + dense point clouds to ./modal_results/.
# Run the streaming server locally
python -m server.app --port 5000
# Or run offline SLAM with Viser visualization (localhost:8080)
python main.py --image_folder /path/to/images --max_loops 1 --vis_map
# Quick test with bundled sample data
unzip office_loop.zip
python main.py --image_folder office_loop --max_loops 1 --vis_mapOpen Reality accepts three input types:
Live camera β open the sender page on any phone, no install needed.
Recorded video β upload directly via the UI, or extract frames manually.
| Flag | Default | Description |
|---|---|---|
--submap_size |
16 | Frames per submap batch |
--min_disparity |
50 | Optical flow threshold for keyframe selection |
--conf_threshold |
25 | Filter bottom N% lowest-confidence points |
--lc_thres |
0.95 | Loop closure similarity threshold |
--max_loops |
0 | Enable loop closure (0 or 1) |
--vis_voxel_size |
β | Downsample point cloud for visualization |
--run_os |
off | Enable open-set 3D object detection |
Phone Camera / Video File / Image Folder
β WebSocket stream (HTTPS via Modal tunnel)
Keyframe Selection
Lucas-Kanade optical flow β skip frames without motion
β
VGGT-1B Vision Model
Predicts dense depth + camera pose per frame
No GPS. No depth sensor. Just pixels.
β
CLIP + SAM3
CLIP scores every submap against target queries
SAM3 segments matches β projects to 3D bounding box
β
GTSAM Pose Graph
SL(4) manifold optimization
Loop closure keeps the global map consistent
β
Spatial Agent (Claude / Gemini)
Interprets user intent, plans searches,
answers spatial questions with real geometry
| Component | Role |
|---|---|
Solver (solver.py) |
Central coordinator β owns the map, pose graph, retrieval system, and viewer |
StreamingSLAM (server/streaming_slam.py) |
Wraps Solver for frame-by-frame WebSocket streaming |
PoseGraph (graph.py) |
GTSAM SL(4) manifold optimization with inter-submap and loop closure constraints |
ObjectDetector (object_detector.py) |
PE-Core CLIP + SAM3 for open-set 3D bounding box detection |
SpatialAgent (server/spatial_agent.py) |
LLM-powered agent with SLAM-backed tools for spatial reasoning |
ImageRetrieval (loop_closure.py) |
DINO-Salad descriptors for loop closure detection |
Frontend (server/webserver/) |
Vite + TypeScript + Three.js β 3D viewer, camera sender, plan UI, summary page |
| Page | Purpose |
|---|---|
index.html |
Live 3D SLAM viewer with point clouds and detections |
sender.html |
Camera/video input β streams frames to the server |
plan.html |
Agent plan visualization β mission planning UI |
summary.html |
Detection summary β 2D floorplan + 3D overview + spatial Q&A |
Python Β· Modal Β· PyTorch Β· VGGT-1B Β· GTSAM Β· CLIP Β· SAM3 Β· DINO-Salad Β· Flask Β· Socket.IO Β· Three.js Β· Vite Β· TypeScript Β· Claude Β· Gemini
Open Reality builds on VGGT-SLAM 2.0 by Dominic Maggio and Luca Carlone at MIT SPARK Lab. We extend their research-grade dense SLAM system into a cloud-native, agentic spatial AI platform β deployed on Modal, accessible from any phone, and powered by autonomous spatial reasoning.
Open Reality Β· HackIllinois 2026 Β· Modal Track
Spatial intelligence for anyone, anywhere.