feat: Phase 5 MuJoCo Playground PPO benchmarks (54 envs)#534
Closed
feat: Phase 5 MuJoCo Playground PPO benchmarks (54 envs)#534
Conversation
Integrates 54 GPU-accelerated JAX/MJX environments via `playground/` prefix: - PlaygroundVecEnv wrapping wrap_for_brax_training (VmapWrapper + EpisodeWrapper + AutoReset) - PlaygroundGPUEnv subclass for zero-copy JAX→PyTorch DLPack path - 37,855 fps on CPU (10-100x faster on GPU); all 326 tests pass - Specs: 25 DM Control + 19 Locomotion + 10 Manipulation environments - Optional dep: `uv sync --group playground` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JAX/MJX envs aren't gym-registered; util.parallelize(gym.make) spawns 400 processes across 4 sessions, deadlocking sessions 1&3 on GPU init. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PlaygroundVecEnv.render() extracts env[0] state and calls base_env.render(). PlaygroundRenderWrapper drives a pygame window per step (--render flag support). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- replay.py: store CUDA tensor obs directly (no cpu().numpy() roundtrip) - ml_util.py: batch_get stacks tensors, to_torch_batch skips numpy for existing tensors - wrappers.py: TorchNormalizeObservation (Welford, lazy-init, device-agnostic) - __init__.py: wire TorchNormalizeObservation for playground GPU mode Numpy path unchanged. Auto-detects tensor vs numpy — no config flags. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Convert playground specs from JSON to YAML+TorchArc (benchmark_arc/playground/)
- ppo_playground_arc.yaml: [64,64] tanh, ${env}/${max_frame} substitution, device=cuda
- sac_playground_arc.yaml: [256,256] relu, normalize_obs, device=cuda
- Delete 4 old JSON specs and benchmark/playground/ directory
- Unify dstack YAML: remove run-gpu-playground.yml, add PLAYGROUND env var to
run-gpu-train.yml with conditional uv sync --group playground; fix max_duration 8h→6h;
add XLA_PYTHON_CLIENT_PREALLOCATE=false
- Update remote.py: --playground passes PLAYGROUND=true env var (no separate config)
- Polish BENCHMARKS.md Phase 5: clean header, Install line, PPO+SAC columns,
remove per-env Spec column, add Phase 5 progress row
- Update CLAUDE.md: JSON→YAML spec references, max_concurrent_runs note
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ict syntax) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…needs CPU obs) PPO uses on-policy rollouts that don't benefit from GPU-native observations. device:cuda is only needed for SAC's off-policy Replay buffer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…louts) OnPolicyBatchReplay.sample() returns raw lists; when env uses device=cuda the list contains CUDA tensors. Previously hit np.array() path which fails on CUDA. Fix: detect list-of-tensors early and torch.stack() + send to device. Restore device:cuda to ppo_playground_arc.yaml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- device defaults to auto: resolves cuda if available, else cpu - no longer need device:cuda in specs — matches net gpu:auto pattern - remove device:cuda from ppo/sac playground arc specs - log device resolution: 'Playground device: GPU/CPU' + 'JAX→PyTorch via DLPack/numpy' Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When playground env returns CPU tensors (JAX on CPU backend in subprocess), policy_util.calc_pdparam was skipping the device move for tensor inputs. Fix: always move state to net.device regardless of tensor type. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Updated from 213.5 (500K frames) to 709.15 (1M frames, 4 sessions). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cs + SKILL.md plot mandate
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Control envs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Data-rich SAC exploits JAX/MJX free simulation by decoupling data collection from training: 256/512 envs with freq=16/32 env-steps between training calls. UTD≈0.001 per literature (Raffin 2025: RR≈0.03 optimal for 1024 parallel envs). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- .dstack/run-gpu-train.yml: install jax[cuda12] after playground group sync
so JAX uses GPU acceleration (was silently CPU-only without this)
- env/playground.py: DLPack transfer handles CPU→GPU case with explicit .to()
- env/__init__.py: check both JAX+PyTorch GPU, warn if JAX is CPU-only
- sac_playground_arc_datarich.yaml: add DR128 (UTD=0.008, Raffin-style balanced)
alongside existing DR256/DR512; use ${max_frame} for configurable frame budget
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… regression confirmed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Key changes to close the gap with official mujoco_playground PPO: - time_horizon: 30→480 (16 unrolls × 30, matching Brax batch collection) - minibatch_size: 4096→30720 (32 minibatches, matching Brax) - clip_eps: 0.2→0.3 (matching Brax DM Control default) - clip_grad_val: 1.0→null (no grad clipping for DM Control) - LR schedule: linear decay→constant (min_factor=1.0) - Value network: 3→5 layers of 256 (matching Brax) - reward_scale: 10.0 (matching Brax, applied in env wrapper) - log_std max clamp: 0.5→2.0 (matching Brax softplus range) - Loco spec retains clip_grad_val=1.0 (official loco uses it) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… backend selection - Monkey-patch mjx_env.make_data to set naccdmax=naconmax when missing, preventing CCD buffer overflow for locomotion/manipulation envs - Guard impl='warp' override with hasattr check — AeroCube lacks impl field in its locked config, causing KeyError on load - Move reward_scale from PlaygroundVecEnv to VectorTransformReward wrapper after RecordEpisodeStatistics so metrics track raw rewards Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PyPI playground 0.1.0 uses old nconmax API, but runtime mujoco-mjx uses newer naconmax/naccdmax split, causing CCD buffer overflow for locomotion and manipulation envs with mesh/convex colliders. Install from git HEAD which has the correct API matching the monkey-patch in playground.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s envs Two fixes in playground.py: 1. Suppress C-level MuJoCo warnings (ccd_iterations, nefc/broadphase overflow) that repeat every step, exploding dstack log/output size over 100M frames. 2. For dict-obs envs (Go1, G1, T1, Leap, Aero), pass only "state" key to actor instead of concatenating privileged_state+state. Fixes incorrect obs contract where actor received ground-truth data it shouldn't see. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FishSwim (89%), PointMass (96%), ReacherHard (99.7%), WalkerStand (95%), WalkerWalk (98%), SpotGetup (97%), SpotJoystickGaitTracking (97%), AlohaHandOver (73%) — all close enough to targets, not worth more compute. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- pavlovian.py: 10 TC tasks, 18-dim obs, 2-DOF action, two-phase protocol - eval.py: run_eval, Clopper-Pearson CI, threshold checking - gates.py: Checkpoint A/B/D, DINOv2 probe gate - 11 YAML configs (base + TC-01 to TC-10) - 169 tests pass (132 pavlovian + 37 eval) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- sensorimotor.py: 7-DOF arm + gripper, 20 objects, PD control, Dict obs - sensorimotor_tasks.py: 14 task definitions with reward/score/termination - 15 YAML configs (base + TC-11 to TC-24) - gates.py CHECKPOINT_B key fix - 501 tests pass (332 sensorimotor + 132 pavlovian + 37 eval) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JAX JIT is async — _jit_step returns immediately but warnings print when the GPU kernel actually executes. Must block before restoring stderr, otherwise 2.6M+ warnings leak through. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- perception.py: ProprioceptionEncoder (6 groups → 512), ObjectStateEncoder - being_embedding.py: ChannelAttention, HierarchicalFusion, ThrownessEncoder, TemporalAttention, BeingEmbedding (→ 512-dim) - dasein_net.py: DaseinNet integrating L0→L1→policy/value heads for PPO - dasein_sensorimotor.yaml: config for TC-11 - 559 tests pass (L0 27 + L1 61 + DaseinNet 30 + envs 332+132 + eval 37) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SpotFlatTerrainJoystick: 11.71 → 45.75 (target 30) — passed AlohaSinglePegInsertion: 188.03 → 216.36 (target 300) — improved Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Go1Handstand: 6.48 → 17.88 (target 15) — obs fix + loco_precise H1InplaceGaitTracking: 4.10 → 5.54 (target 10) — loco_go1 wider net Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
93% of target 30 — close enough. loco_precise spec (clip=0.2, entropy=0.005) significantly improved over default loco. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both H1 envs now pass targets with loco_precise spec: - H1InplaceGaitTracking: 5.54 → 11.95 (target 10) - H1JoystickGaitTracking: 27.83 → 31.11 (target 30) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ly instead block_until_ready in step() forced synchronous JAX execution, killing performance for slow-FPS envs (Humanoid: 114→10, ~10x regression). Instead, suppress stderr permanently after first step — MuJoCo C warnings are silenced without any per-call overhead or sync barriers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- emotion.py: InteroceptionModule, EmotionModule (3 types), IntrinsicMotivation, MoodVector, FrustrationAccumulator — L3 Phase 3.2a subset - emotion_replay.py: EmotionTaggedReplayBuffer (1M, PER α=0.6, stage-aware) - curriculum.py: CurriculumSequencer TC-01→24, mastery detection, gate integration - 753 Phase 3 tests pass (59 emotion + 28 replay + 47 curriculum + existing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fixes - 10 integration tests (end-to-end pipeline: env→L0→L1→PPO→emotion→replay→curriculum) - Dead code removed (pavlovian, sensorimotor) - Magic numbers → named constants (dasein_net D_MODEL, sensorimotor MAX_EPISODE_STEPS) - NoveltyReward: nn.Module → plain class (no parameters) - Logger shadowing fixed (eval.py, gates.py) - L0Output dataclass consolidated (removed duplicate) - 763 tests pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nfoNCE - vision.py: DINOv2Backbone (LoRA rank 16), multi-scale features, StereoFusionModule (QK-Norm), dual-rate cache - film.py: FiLMLayer (identity init γ=1 β=0), MoodFiLM, EmotionFiLM, SomaticMarkerSystem - dasein_net.py: vision mode integration, InfoNCE regrounding loss (α=0.1) - sensorimotor.py: stereo camera rendering option - 856 tests pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CartpoleSwingup: 675.98 → 729.09 (target 800, 91%) — close enough AlohaSinglePegInsertion: 222.49 → 223.26 (target 300) — marginal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updated 9 HF links to match actual runs that produced current best scores. Regenerated all plots from latest HF data. Added missing plots for Go1Getup, BerkeleyHumanoidJoystickRoughTerrain, PandaPickCubeCartesian. Ensures reproducibility: every ✅/⚠️ row now has matching score, HF link, and plot from the same run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…bles Audit and cleanup: - Fixed score/HF link mismatches (CartpoleSwingup, FingerTurnHard, AlohaPeg) - Regenerated all Phase 5 plots from correct HF data - Added G1JoystickRoughTerrain (-2.75) and AeroCubeRotateZAxis (-3.09) results - Reverted loco specs to num_envs=2048 for reproducibility - Removed CrossQ/SAC placeholder rows (PPO only) - Simplified tables: ENV | MA | SPEC_NAME | HF Data - Added sub-phase descriptions - Reverted accidental Phase 1-4 plot changes - Removed ❌ emoji from all tables Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 54 Phase 5 data folders uploaded to public SLM-Lab/benchmark. Updated all HF links from benchmark-dev to benchmark. Docs and plots uploaded to public HF. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- vision.py: rename ambiguous var `l` → `layer` - sensorimotor_tasks.py: rename ambiguous var `l` → `long` - test_film.py: move F import to top of file - Add scipy as dev dependency (needed by test_curriculum, test_eval, test_integration) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Owner
Author
|
Replaced by clean Phase 5-only PR |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
benchmark-devto publicSLM-Lab/benchmarkCode changes
slm_lab/env/playground.py: Suppress MuJoCo C-level stderr warnings (permanent fd redirect after first step); fix dict-obs handling to pass only "state" key for asymmetric-obs envs (Go1, G1, T1)slm_lab/spec/benchmark_arc/ppo/ppo_playground.yaml: Reverted loco specs tonum_envs=2048for reproducibility; addedppo_playground_loco_precise,ppo_playground_loco_go1,ppo_playground_manip_aloha_peg,ppo_playground_manip_dexterousspec variantsdocs/BENCHMARKS.md: Phase 5 tables cleaned up (PPO only, simplified columns), sub-phase descriptions added, all HF links verified against public benchmarkKey results
ppo_playground_loco_precise(clip=0.2, entropy=0.005) was the breakthrough spec for locomotionTest plan
uv run python3 -c "from slm_lab.env.playground import PlaygroundVecEnv; print('OK')"passesgit diff master -- docs/plots/CartPole-v1*)🤖 Generated with Claude Code