Land eval contract + recording hygiene + Newton integration backend on main#1
Open
amarrmb wants to merge 7 commits into
Open
Land eval contract + recording hygiene + Newton integration backend on main#1amarrmb wants to merge 7 commits into
amarrmb wants to merge 7 commits into
Conversation
… spatial breakdown Adds robosandbox.eval — the evaluation contract layer the rest of the project hangs evaluation off of. summarise_eval() turns per-trial outcomes into a v2 EvalSummary dict with success rate, Wilson 95% CI, optional spatial breakdown (cube-pose bins → per-bin success rate), and a provenance block (checkpoint sha256 + git rev + library versions). Two-proportion z-test helpers wired in for compare workflows. Makes "compare two checkpoints under the same task contract" a one-command operation rather than an ad hoc script per project. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r pipeline run_policy: - success criterion is now LATCHED on first per-step match (not just final state) — chunked policies that briefly succeed and then fail still register - action_repeat: hold each policy action for N sim steps so 30 fps recordings can replay correctly in a 200 Hz sim - single observe per step (the post-step obs becomes next step's input) - success-detection uses tasks.runner._eval_criterion via a thin shim LeRobotPolicyAdapter: - preprocessor / postprocessor pipeline parameters (lerobot 0.4 moved per-feature normalization out of the model's forward()) - reset() forwards to the wrapped policy so chunked-action queues (ACT, diffusion) clear between trials — without this, eval results silently depended on trial order load_policy: - LeRobot-checkpoint branch loads the matching processor pipeline pair - visual-input keys auto-detected from policy.config.input_features so legacy diffusion_pusht (observation.image) and ACT (observation.images.<cam>) both work without per-checkpoint glue Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rames lerobot v3 expects parquet shards in a specific subdirectory layout (data/chunk-NNN/episode-NNNNNN.parquet); the old exporter wrote flat parquet which `lerobot train` rejected at load time. Also: - multi-episode export: pass a parent runs/ dir, get one dataset - preserve gripper as the last column of action (not in observation.state) so policies that emit (n_arm + 1)-dim actions can be trained directly - reject action=None frames at the column-builder seam — silently dropping them was producing datasets with phantom rows where the policy had no command to predict Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… MuJoCo backend scene/robot_loader: robot MJCFs ship with <keyframe> blocks sized to the robot in isolation. Once inject_scene_objects adds free-joint objects (each adds 7 qpos), MuJoCo refuses to compile the model with "keyframe N: invalid qpos size, expected length M". Strip them at load time — robot home pose flows through the sidecar (home_qpos) anyway. sim/mujoco_backend: capture the last (target_joints, gripper) commanded via step() and expose via last_action(). Lets the recorder populate the JSONL `action` field uniformly without each skill having to thread the target through ctx.on_step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ect helper - task definitions can now declare supported_backends (mujoco/newton/...) so the CLI can fail clearly when a task is asked to run on an incompatible sim - pick_cube_franka_random.yaml: a randomize block (cube xyz jitter) — gives evals real per-trial variability so success rate is meaningful instead of measuring one fixed scene - tasks/runner: criterion_target_object() centralizes the "what cube does this success criterion track" lookup that used to be duplicated across the eval CLI and per-trial details collector Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ sim-check RoboSandbox is the integration layer; Newton is one of the simulators it plays with. This lands the Newton integration as a parallel backend choice behind the same sim API: - sim/factory: create_sim_backend(name, **kw) — registry + entry-point dispatch so the CLI and downstream code stay backend-agnostic - sim/newton_backend: state-only by default (joint_q + body_q across N parallel worlds via newton.ModelBuilder + MuJoCo-Warp solver). Opt-in RGB via enable_camera=True wires newton.sensors.SensorTiledCamera to raytrace per-world RGB on the GPU. Lazy import of warp/newton so the module is safe to import on CPU-only machines. - eval/sim_check: drive the same policy through MuJoCo and Newton, report trajectory agreement — the reproducibility check for any cross-sim integration claim. Importantly: Newton runs require warp + newton ≥ 1.1 in the active env. RoboSandbox does not bundle them; users opt in via `pip install warp newton` or use the newton venv on a GPU host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pluggable viewer backend
CLI gains four user-facing eval-loop verbs:
- eval: run a policy against a task; n-trials with deterministic per-trial
reload, settle-steps, action-repeat, and structured JSON output (the
EvalSummary v2 shape). Single-world MuJoCo or N parallel Newton worlds
via --sim-backend / --world-count.
- compare: two-proportion z-test on two prior eval JSONs ("did checkpoint B
beat checkpoint A?") so you don't eyeball CIs.
- sim-check: drive the same policy through both backends, report
trajectory agreement — the reproducibility hook for any cross-sim claim.
- simulate: drop a task into a backend and step it; useful for debugging
scene loading without running a full eval.
viewer/server: route through sim.factory.create_sim_backend so the live
viewer can pick MuJoCo or Newton just like the CLI eval can.
The PPO `train` subcommand is intentionally left off `main` — RoboSandbox
is not a training framework. PPO lives on the experimental/newton-eval
branch as research, not product.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
`experimental/newton-eval` had grown into a usable platform but nothing of it had reached `main`. This lands the on-thesis subset — the inspectable experimentation layer the project doc describes — so a new visitor can run a real evaluation end-to-end on `main`.
Per the "why robosandbox exists" thesis, RoboSandbox is the integration layer; bringing your own training framework, your own simulator at scale, your own real robot is the whole point. So this includes:
Commits (7, themed for review)
What you can do on `main` after this lands
Test plan
What stays on `experimental/newton-eval`
🤖 Generated with Claude Code