Skip to content

Land eval contract + recording hygiene + Newton integration backend on main#1

Open
amarrmb wants to merge 7 commits into
mainfrom
merge/eval-and-recording-hygiene
Open

Land eval contract + recording hygiene + Newton integration backend on main#1
amarrmb wants to merge 7 commits into
mainfrom
merge/eval-and-recording-hygiene

Conversation

@amarrmb
Copy link
Copy Markdown
Owner

@amarrmb amarrmb commented Apr 25, 2026

Why

`experimental/newton-eval` had grown into a usable platform but nothing of it had reached `main`. This lands the on-thesis subset — the inspectable experimentation layer the project doc describes — so a new visitor can run a real evaluation end-to-end on `main`.

Per the "why robosandbox exists" thesis, RoboSandbox is the integration layer; bringing your own training framework, your own simulator at scale, your own real robot is the whole point. So this includes:

  • ✅ The eval contract (`eval/stats.py`, structured JSON, Wilson CI, fingerprinting, spatial breakdown)
  • ✅ Workflow plumbing (LeRobot v3 export, processor pipeline support, action_repeat, settle/reload)
  • ✅ Newton as an integration example showing how the same task definition runs against a different sim — same posture as using external LeRobot for training
  • ❌ The PPO RL module — "we don't train in-tree" is explicit in the thesis. Stays on `experimental/newton-eval`.

Commits (7, themed for review)

Commit Theme
`bdab0a9` `feat(eval)`: structured JSON, Wilson CIs, per-trial provenance, spatial breakdown
`da429b5` `feat(policy)`: per-trial reset/reload, action_repeat, LeRobot processor pipeline
`7488fd7` `fix(export)`: LeRobot v3 layout + multi-episode + reject action=None
`39cd089` `fix(scene+sim)`: strip MJCF keyframes; record last_action() on MuJoCo backend
`9414a29` `feat(tasks)`: supported_backends, randomize spec, criterion target-object helper
`472ccae` `feat(sim)`: Newton integration backend (state + opt-in batched RGB) + sim-check
`3fc36be` `feat(cli+viewer)`: eval / compare / sim-check / simulate subcommands

What you can do on `main` after this lands

# Multi-trial eval with statistical CI + provenance
robo-sandbox eval --task pick_cube_franka_random --policy /path/to/lerobot_act_ckpt \\
    --sim-backend mujoco --n-trials 64 --output eval_act_50k.json

# Compare two checkpoints (two-proportion z-test)
robo-sandbox compare eval_act_50k.json eval_act_100k.json

# Same eval against the Newton integration backend, GPU-parallel
robo-sandbox eval --task pick_cube_franka_random --policy /path/to/lerobot_act_ckpt \\
    --sim-backend newton --world-count 64

# Cross-sim agreement check
robo-sandbox sim-check --task pick_cube_franka --policy runs/<ep>

Test plan

  • `pytest tests/` (excluding the known probabilistic Pick MuJoCo flake): 176 passed, 1 skipped
  • CLI smoke (`robo-sandbox eval --help`, `compare --help`, `sim-check --help`) — all four new subcommands wired
  • Import sanity (`from robosandbox import policy, eval, sim, tasks, export, scene`)
  • End-to-end MuJoCo eval with a real ACT checkpoint (run after merge for the public quickstart)
  • End-to-end Newton eval on DGX (already validated on `experimental/newton-eval`)

What stays on `experimental/newton-eval`

  • `robosandbox/rl/` (PPO module — research, not product)
  • `robosandbox/cli.py` `train` subcommand (calls into rl/)
  • DGX-specific session probe scripts (`scripts/probe_*`, `scripts/record_newton_parallel_demo.py`)
  • Newton object-kinds beyond `box` (next sprint on the experimental branch)

🤖 Generated with Claude Code

amarrmb and others added 7 commits April 25, 2026 12:09
… spatial breakdown

Adds robosandbox.eval — the evaluation contract layer the rest of the
project hangs evaluation off of. summarise_eval() turns per-trial outcomes
into a v2 EvalSummary dict with success rate, Wilson 95% CI, optional
spatial breakdown (cube-pose bins → per-bin success rate), and a
provenance block (checkpoint sha256 + git rev + library versions).
Two-proportion z-test helpers wired in for compare workflows.

Makes "compare two checkpoints under the same task contract" a one-command
operation rather than an ad hoc script per project.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r pipeline

run_policy:
- success criterion is now LATCHED on first per-step match (not just final
  state) — chunked policies that briefly succeed and then fail still register
- action_repeat: hold each policy action for N sim steps so 30 fps recordings
  can replay correctly in a 200 Hz sim
- single observe per step (the post-step obs becomes next step's input)
- success-detection uses tasks.runner._eval_criterion via a thin shim

LeRobotPolicyAdapter:
- preprocessor / postprocessor pipeline parameters (lerobot 0.4 moved
  per-feature normalization out of the model's forward())
- reset() forwards to the wrapped policy so chunked-action queues
  (ACT, diffusion) clear between trials — without this, eval results
  silently depended on trial order

load_policy:
- LeRobot-checkpoint branch loads the matching processor pipeline pair
- visual-input keys auto-detected from policy.config.input_features so
  legacy diffusion_pusht (observation.image) and ACT (observation.images.<cam>)
  both work without per-checkpoint glue

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rames

lerobot v3 expects parquet shards in a specific subdirectory layout
(data/chunk-NNN/episode-NNNNNN.parquet); the old exporter wrote flat
parquet which `lerobot train` rejected at load time.

Also:
- multi-episode export: pass a parent runs/ dir, get one dataset
- preserve gripper as the last column of action (not in observation.state)
  so policies that emit (n_arm + 1)-dim actions can be trained directly
- reject action=None frames at the column-builder seam — silently dropping
  them was producing datasets with phantom rows where the policy had no
  command to predict

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… MuJoCo backend

scene/robot_loader: robot MJCFs ship with <keyframe> blocks sized to the
robot in isolation. Once inject_scene_objects adds free-joint objects
(each adds 7 qpos), MuJoCo refuses to compile the model with
"keyframe N: invalid qpos size, expected length M". Strip them at load
time — robot home pose flows through the sidecar (home_qpos) anyway.

sim/mujoco_backend: capture the last (target_joints, gripper) commanded
via step() and expose via last_action(). Lets the recorder populate the
JSONL `action` field uniformly without each skill having to thread the
target through ctx.on_step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ect helper

- task definitions can now declare supported_backends (mujoco/newton/...)
  so the CLI can fail clearly when a task is asked to run on an
  incompatible sim
- pick_cube_franka_random.yaml: a randomize block (cube xyz jitter) — gives
  evals real per-trial variability so success rate is meaningful instead
  of measuring one fixed scene
- tasks/runner: criterion_target_object() centralizes the
  "what cube does this success criterion track" lookup that used to be
  duplicated across the eval CLI and per-trial details collector

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ sim-check

RoboSandbox is the integration layer; Newton is one of the simulators it
plays with. This lands the Newton integration as a parallel backend choice
behind the same sim API:

- sim/factory: create_sim_backend(name, **kw) — registry + entry-point
  dispatch so the CLI and downstream code stay backend-agnostic
- sim/newton_backend: state-only by default (joint_q + body_q across N
  parallel worlds via newton.ModelBuilder + MuJoCo-Warp solver). Opt-in
  RGB via enable_camera=True wires newton.sensors.SensorTiledCamera to
  raytrace per-world RGB on the GPU. Lazy import of warp/newton so the
  module is safe to import on CPU-only machines.
- eval/sim_check: drive the same policy through MuJoCo and Newton, report
  trajectory agreement — the reproducibility check for any
  cross-sim integration claim.

Importantly: Newton runs require warp + newton ≥ 1.1 in the active env.
RoboSandbox does not bundle them; users opt in via `pip install warp
newton` or use the newton venv on a GPU host.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pluggable viewer backend

CLI gains four user-facing eval-loop verbs:

- eval: run a policy against a task; n-trials with deterministic per-trial
  reload, settle-steps, action-repeat, and structured JSON output (the
  EvalSummary v2 shape). Single-world MuJoCo or N parallel Newton worlds
  via --sim-backend / --world-count.
- compare: two-proportion z-test on two prior eval JSONs ("did checkpoint B
  beat checkpoint A?") so you don't eyeball CIs.
- sim-check: drive the same policy through both backends, report
  trajectory agreement — the reproducibility hook for any cross-sim claim.
- simulate: drop a task into a backend and step it; useful for debugging
  scene loading without running a full eval.

viewer/server: route through sim.factory.create_sim_backend so the live
viewer can pick MuJoCo or Newton just like the CLI eval can.

The PPO `train` subcommand is intentionally left off `main` — RoboSandbox
is not a training framework. PPO lives on the experimental/newton-eval
branch as research, not product.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant