Video to Data (V2D)

An end-to-end pipeline that converts human demonstration videos into simulation-ready assets and physics-grounded robot training data.

Documentation · Robotic Grounding Project Page · Robotic Grounding Tech Report

Overview

Video to Data (V2D) turns raw human demonstrations into robot-ready training data through three composable stages. Each stage runs independently and writes its artifacts to disk, so you can stop, inspect, cache, and recompose the pipeline at any boundary.

Video Ingestion Agent — a LangGraph-driven agentic workflow that segments demonstration videos into temporally-bounded action clips, extracts an entity-relation scene graph, and stores per-frame SigLIP-2 embeddings. The result is a queryable action database (graph.db + vector.db) that lets downstream stages select which clips to process via natural-language retrieval, instead of brute-forcing the full video.
Reconstruction — containerized vision modules turn the selected RGB (or stereo) clips into per-frame depth, object masks, textured meshes, 6-DoF object poses, and SMPL human body parameters. Multi-view pipelines (run_mv_hoi_reconstruction, run_mv_calibration) orchestrate the full reconstruction from a rosbag.
Robotic Grounding — human motion is retargeted onto the target robot embodiment (Sharpa or G1), then the reconstructed scene and retargeted motion drive Isaac Lab environments trained with RL to produce deployable policies.

Demos

The pipeline in action — from a raw human demonstration, to grounded policies trained in Isaac Lab, to deployment on a physical robot.

Packages

Package	Role	Runtime
`video_ingestion_agent/`	Video → action segments + entity scene graph + frame embeddings. LangGraph pipeline (segment → verify/refine → entity graph → embeddings) plus an EGAgent-style natural-language retrieval agent and an optional Gradio UI.	Python venv + vLLM server
`reconstruction/`	Video → depth, masks, meshes, 6D poses, human body. 18 containerized modules + multi-view pipelines.	Docker (per-module images)
`robotic_grounding/`	RL training on NVIDIA Isaac Lab with PPO; motion retargeting utilities. Code will be published in a later release.	Coming soon

Prerequisites

Docker with GPU support (install)
NVIDIA Container Toolkit
Python 3.10+

Quickstart

Video Ingestion Agent (video → queryable action database)

cd video_ingestion_agent

uv venv .venv && source .venv/bin/activate
uv pip install -e ".[all]"     # vLLM, webapp, benchmark, dev tools

# 1. Start the vLLM server (loads the VLM, ~1 minute)
python scripts/serve.py -c configs/ingestion.yaml

# 2. Ingest a video — segmentation → entity graph → report
python scripts/run_ingestion.py path/to/video.mp4 \
  -c configs/ingestion.yaml --no-verify -o runs/my_run

# 3. Retrieve clips with natural language
python scripts/run_retrieval.py "Find clips where someone picks up a mug" \
  -d outputs/ -c configs/retrieval.yaml

# 4. Or browse interactively in the web UI
python scripts/run_webapp.py

See video_ingestion_agent/README.md for hardware requirements, the full extras list, the verify/refine loop, and batch-ingestion across multiple GPUs. Pre-publication TODOs are tracked in video_ingestion_agent/docs/release_readiness.md.

Reconstruction (video → reconstructed trajectory and simulation assets)

The reconstruction subfolder contains a variety of algorithms and pipelines for human-object reconstruction. For the initial release, we provide an example pipeline for ego-centric hand-object reconstruction — follow the setup instructions here.

Note: The reconstruction subfolder contains a wide variety of packages, many of which are partially tested or in development. You may find these packages useful, but please note they are subject to change. The ego-centric pipeline above has been tested and is officially included as part of the initial Video to Data release. If there is a package you would like to see supported, or you have any feedback, please open an issue on GitHub.

Robotic Grounding (reconstructed trajectory → RL policy and dataset)

The Robotic Grounding stage (motion retargeting + Isaac Lab RL training) will be publicly available in a later release. See robotic_grounding/README.md for an overview. The tech report and project page are available.

Design philosophy

Host orchestration, containerized inference. The host runs thin Python wrappers that docker run each module; all ML dependencies live inside their respective images. No CUDA or PyTorch is ever installed on the host.
Typed contracts between packages. Modules communicate through strongly-typed dataclasses in v2d_common (DepthImage, CameraIntrinsics, Transform3d, BoundingBox, Mask) — never raw arrays across package boundaries.
File-based dataflow. Modules write intermediate artifacts to disk (depth PNGs, pose JSONs, mask PNGs, etc.), enabling independent execution, caching, and pipeline composition via v2d_pipelines.

Contributing

See the contributing guide in reconstruction/README.md for adding new reconstruction modules. Each new module must expose a Docker image, a run_download_weights entry point (if weights are required), a run_shell entry point, and a typed API surface consistent with v2d_common.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.claude/skills		.claude/skills
.github		.github
docs		docs
reconstruction		reconstruction
robotic_grounding		robotic_grounding
video_ingestion_agent		video_ingestion_agent
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Video to Data (V2D)

Contents

Overview

Demos

Packages

Prerequisites

Quickstart

Video Ingestion Agent (video → queryable action database)

Reconstruction (video → reconstructed trajectory and simulation assets)

Robotic Grounding (reconstructed trajectory → RL policy and dataset)

Design philosophy

Contributing

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Video to Data (V2D)

Contents

Overview

Demos

Packages

Prerequisites

Quickstart

Video Ingestion Agent (video → queryable action database)

Reconstruction (video → reconstructed trajectory and simulation assets)

Robotic Grounding (reconstructed trajectory → RL policy and dataset)

Design philosophy

Contributing

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages