An end-to-end pipeline that converts human demonstration videos into simulation-ready assets and physics-grounded robot training data.
Documentation · Robotic Grounding Project Page · Robotic Grounding Tech Report
Video to Data (V2D) turns raw human demonstrations into robot-ready training data through three composable stages. Each stage runs independently and writes its artifacts to disk, so you can stop, inspect, cache, and recompose the pipeline at any boundary.
- Video Ingestion Agent — a LangGraph-driven agentic workflow that segments demonstration videos into temporally-bounded action clips, extracts an entity-relation scene graph, and stores per-frame SigLIP-2 embeddings. The result is a queryable action database (
graph.db+vector.db) that lets downstream stages select which clips to process via natural-language retrieval, instead of brute-forcing the full video. - Reconstruction — containerized vision modules turn the selected RGB (or stereo) clips into per-frame depth, object masks, textured meshes, 6-DoF object poses, and SMPL human body parameters. Multi-view pipelines (
run_mv_hoi_reconstruction,run_mv_calibration) orchestrate the full reconstruction from a rosbag. - Robotic Grounding — human motion is retargeted onto the target robot embodiment (Sharpa or G1), then the reconstructed scene and retargeted motion drive Isaac Lab environments trained with RL to produce deployable policies.
The pipeline in action — from a raw human demonstration, to grounded policies trained in Isaac Lab, to deployment on a physical robot.
| Package | Role | Runtime |
|---|---|---|
video_ingestion_agent/ |
Video → action segments + entity scene graph + frame embeddings. LangGraph pipeline (segment → verify/refine → entity graph → embeddings) plus an EGAgent-style natural-language retrieval agent and an optional Gradio UI. | Python venv + vLLM server |
reconstruction/ |
Video → depth, masks, meshes, 6D poses, human body. 18 containerized modules + multi-view pipelines. | Docker (per-module images) |
robotic_grounding/ |
RL training on NVIDIA Isaac Lab with PPO; motion retargeting utilities. Code will be published in a later release. | Coming soon |
- Docker with GPU support (install)
- NVIDIA Container Toolkit
- Python 3.10+
cd video_ingestion_agent
uv venv .venv && source .venv/bin/activate
uv pip install -e ".[all]" # vLLM, webapp, benchmark, dev tools
# 1. Start the vLLM server (loads the VLM, ~1 minute)
python scripts/serve.py -c configs/ingestion.yaml
# 2. Ingest a video — segmentation → entity graph → report
python scripts/run_ingestion.py path/to/video.mp4 \
-c configs/ingestion.yaml --no-verify -o runs/my_run
# 3. Retrieve clips with natural language
python scripts/run_retrieval.py "Find clips where someone picks up a mug" \
-d outputs/ -c configs/retrieval.yaml
# 4. Or browse interactively in the web UI
python scripts/run_webapp.pySee video_ingestion_agent/README.md for hardware requirements, the full extras list, the verify/refine loop, and batch-ingestion across multiple GPUs. Pre-publication TODOs are tracked in video_ingestion_agent/docs/release_readiness.md.
The reconstruction subfolder contains a variety of algorithms and pipelines for human-object reconstruction. For the initial release, we provide an example pipeline for ego-centric hand-object reconstruction — follow the setup instructions here.
Note: The reconstruction subfolder contains a wide variety of packages, many of which are partially tested or in development. You may find these packages useful, but please note they are subject to change. The ego-centric pipeline above has been tested and is officially included as part of the initial Video to Data release. If there is a package you would like to see supported, or you have any feedback, please open an issue on GitHub.
The Robotic Grounding stage (motion retargeting + Isaac Lab RL training) will be publicly available in a later release. See robotic_grounding/README.md for an overview. The tech report and project page are available.
- Host orchestration, containerized inference. The host runs thin Python wrappers that
docker runeach module; all ML dependencies live inside their respective images. No CUDA or PyTorch is ever installed on the host. - Typed contracts between packages. Modules communicate through strongly-typed dataclasses in
v2d_common(DepthImage,CameraIntrinsics,Transform3d,BoundingBox,Mask) — never raw arrays across package boundaries. - File-based dataflow. Modules write intermediate artifacts to disk (depth PNGs, pose JSONs, mask PNGs, etc.), enabling independent execution, caching, and pipeline composition via
v2d_pipelines.
See the contributing guide in reconstruction/README.md for adding new reconstruction modules. Each new module must expose a Docker image, a run_download_weights entry point (if weights are required), a run_shell entry point, and a typed API surface consistent with v2d_common.



