Land eval contract + recording hygiene + Newton integration backend on main by amarrmb · Pull Request #1 · amarrmb/robosandbox

amarrmb · 2026-04-25T19:12:15Z

Why

`experimental/newton-eval` had grown into a usable platform but nothing of it had reached `main`. This lands the on-thesis subset — the inspectable experimentation layer the project doc describes — so a new visitor can run a real evaluation end-to-end on `main`.

Per the "why robosandbox exists" thesis, RoboSandbox is the integration layer; bringing your own training framework, your own simulator at scale, your own real robot is the whole point. So this includes:

✅ The eval contract (`eval/stats.py`, structured JSON, Wilson CI, fingerprinting, spatial breakdown)
✅ Workflow plumbing (LeRobot v3 export, processor pipeline support, action_repeat, settle/reload)
✅ Newton as an integration example showing how the same task definition runs against a different sim — same posture as using external LeRobot for training
❌ The PPO RL module — "we don't train in-tree" is explicit in the thesis. Stays on `experimental/newton-eval`.

Commits (7, themed for review)

Commit	Theme
`bdab0a9`	`feat(eval)`: structured JSON, Wilson CIs, per-trial provenance, spatial breakdown
`da429b5`	`feat(policy)`: per-trial reset/reload, action_repeat, LeRobot processor pipeline
`7488fd7`	`fix(export)`: LeRobot v3 layout + multi-episode + reject action=None
`39cd089`	`fix(scene+sim)`: strip MJCF keyframes; record last_action() on MuJoCo backend
`9414a29`	`feat(tasks)`: supported_backends, randomize spec, criterion target-object helper
`472ccae`	`feat(sim)`: Newton integration backend (state + opt-in batched RGB) + sim-check
`3fc36be`	`feat(cli+viewer)`: eval / compare / sim-check / simulate subcommands

What you can do on `main` after this lands

# Multi-trial eval with statistical CI + provenance
robo-sandbox eval --task pick_cube_franka_random --policy /path/to/lerobot_act_ckpt \\
    --sim-backend mujoco --n-trials 64 --output eval_act_50k.json

# Compare two checkpoints (two-proportion z-test)
robo-sandbox compare eval_act_50k.json eval_act_100k.json

# Same eval against the Newton integration backend, GPU-parallel
robo-sandbox eval --task pick_cube_franka_random --policy /path/to/lerobot_act_ckpt \\
    --sim-backend newton --world-count 64

# Cross-sim agreement check
robo-sandbox sim-check --task pick_cube_franka --policy runs/<ep>

Test plan

`pytest tests/` (excluding the known probabilistic Pick MuJoCo flake): 176 passed, 1 skipped
CLI smoke (`robo-sandbox eval --help`, `compare --help`, `sim-check --help`) — all four new subcommands wired
Import sanity (`from robosandbox import policy, eval, sim, tasks, export, scene`)
End-to-end MuJoCo eval with a real ACT checkpoint (run after merge for the public quickstart)
End-to-end Newton eval on DGX (already validated on `experimental/newton-eval`)

What stays on `experimental/newton-eval`

`robosandbox/rl/` (PPO module — research, not product)
`robosandbox/cli.py` `train` subcommand (calls into rl/)
DGX-specific session probe scripts (`scripts/probe_*`, `scripts/record_newton_parallel_demo.py`)
Newton object-kinds beyond `box` (next sprint on the experimental branch)

🤖 Generated with Claude Code

… spatial breakdown Adds robosandbox.eval — the evaluation contract layer the rest of the project hangs evaluation off of. summarise_eval() turns per-trial outcomes into a v2 EvalSummary dict with success rate, Wilson 95% CI, optional spatial breakdown (cube-pose bins → per-bin success rate), and a provenance block (checkpoint sha256 + git rev + library versions). Two-proportion z-test helpers wired in for compare workflows. Makes "compare two checkpoints under the same task contract" a one-command operation rather than an ad hoc script per project. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…r pipeline run_policy: - success criterion is now LATCHED on first per-step match (not just final state) — chunked policies that briefly succeed and then fail still register - action_repeat: hold each policy action for N sim steps so 30 fps recordings can replay correctly in a 200 Hz sim - single observe per step (the post-step obs becomes next step's input) - success-detection uses tasks.runner._eval_criterion via a thin shim LeRobotPolicyAdapter: - preprocessor / postprocessor pipeline parameters (lerobot 0.4 moved per-feature normalization out of the model's forward()) - reset() forwards to the wrapped policy so chunked-action queues (ACT, diffusion) clear between trials — without this, eval results silently depended on trial order load_policy: - LeRobot-checkpoint branch loads the matching processor pipeline pair - visual-input keys auto-detected from policy.config.input_features so legacy diffusion_pusht (observation.image) and ACT (observation.images.<cam>) both work without per-checkpoint glue Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rames lerobot v3 expects parquet shards in a specific subdirectory layout (data/chunk-NNN/episode-NNNNNN.parquet); the old exporter wrote flat parquet which `lerobot train` rejected at load time. Also: - multi-episode export: pass a parent runs/ dir, get one dataset - preserve gripper as the last column of action (not in observation.state) so policies that emit (n_arm + 1)-dim actions can be trained directly - reject action=None frames at the column-builder seam — silently dropping them was producing datasets with phantom rows where the policy had no command to predict Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… MuJoCo backend scene/robot_loader: robot MJCFs ship with <keyframe> blocks sized to the robot in isolation. Once inject_scene_objects adds free-joint objects (each adds 7 qpos), MuJoCo refuses to compile the model with "keyframe N: invalid qpos size, expected length M". Strip them at load time — robot home pose flows through the sidecar (home_qpos) anyway. sim/mujoco_backend: capture the last (target_joints, gripper) commanded via step() and expose via last_action(). Lets the recorder populate the JSONL `action` field uniformly without each skill having to thread the target through ctx.on_step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ect helper - task definitions can now declare supported_backends (mujoco/newton/...) so the CLI can fail clearly when a task is asked to run on an incompatible sim - pick_cube_franka_random.yaml: a randomize block (cube xyz jitter) — gives evals real per-trial variability so success rate is meaningful instead of measuring one fixed scene - tasks/runner: criterion_target_object() centralizes the "what cube does this success criterion track" lookup that used to be duplicated across the eval CLI and per-trial details collector Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…+ sim-check RoboSandbox is the integration layer; Newton is one of the simulators it plays with. This lands the Newton integration as a parallel backend choice behind the same sim API: - sim/factory: create_sim_backend(name, **kw) — registry + entry-point dispatch so the CLI and downstream code stay backend-agnostic - sim/newton_backend: state-only by default (joint_q + body_q across N parallel worlds via newton.ModelBuilder + MuJoCo-Warp solver). Opt-in RGB via enable_camera=True wires newton.sensors.SensorTiledCamera to raytrace per-world RGB on the GPU. Lazy import of warp/newton so the module is safe to import on CPU-only machines. - eval/sim_check: drive the same policy through MuJoCo and Newton, report trajectory agreement — the reproducibility check for any cross-sim integration claim. Importantly: Newton runs require warp + newton ≥ 1.1 in the active env. RoboSandbox does not bundle them; users opt in via `pip install warp newton` or use the newton venv on a GPU host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pluggable viewer backend CLI gains four user-facing eval-loop verbs: - eval: run a policy against a task; n-trials with deterministic per-trial reload, settle-steps, action-repeat, and structured JSON output (the EvalSummary v2 shape). Single-world MuJoCo or N parallel Newton worlds via --sim-backend / --world-count. - compare: two-proportion z-test on two prior eval JSONs ("did checkpoint B beat checkpoint A?") so you don't eyeball CIs. - sim-check: drive the same policy through both backends, report trajectory agreement — the reproducibility hook for any cross-sim claim. - simulate: drop a task into a backend and step it; useful for debugging scene loading without running a full eval. viewer/server: route through sim.factory.create_sim_backend so the live viewer can pick MuJoCo or Newton just like the CLI eval can. The PPO `train` subcommand is intentionally left off `main` — RoboSandbox is not a training framework. PPO lives on the experimental/newton-eval branch as research, not product. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

amarrmb and others added 7 commits April 25, 2026 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Land eval contract + recording hygiene + Newton integration backend on main#1

Land eval contract + recording hygiene + Newton integration backend on main#1
amarrmb wants to merge 7 commits into
mainfrom
merge/eval-and-recording-hygiene

amarrmb commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amarrmb commented Apr 25, 2026

Why

Commits (7, themed for review)

What you can do on `main` after this lands

Test plan

What stays on `experimental/newton-eval`

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant