feat: CrossQ algorithm — SAC without target networks#533
Merged
Conversation
Improve core framework components across multiple modules: - PPO: normalize_advs option, fix advantage computation - Nets: scale_obs support, split_minibatch copy fix, TorchArcNet improvements - Memory: replay buffer performance, uint8 image storage, batch sampling fixes - Plotting: multi-trial graph improvements, data loading fixes - Env: action rescaling wrapper, observation normalization - Utils: math utilities, ml_util enhancements, logging improvements - Tests: normalizer tests, policy util tests, PPO feature tests, profiling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement CrossQ (Bhatt et al., ICLR 2024) as SAC subclass: - Cross-batch normalization: concatenate current/next states through shared critic - Batch Renormalization (Ioffe 2017) for off-policy RL stability - WeightNorm linear layer for actor normalization experiments - SAC extensions: alpha_lr (decoupled entropy LR), fixed_alpha, policy_delay - Entropy tuning: log_alpha clamping, SD-SAC entropy penalty - Numerically stable log-prob squashing for continuous actions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add CrossQ specs covering Classic Control, Box2D, MuJoCo, and Atari: - Classic/Box2D: BRN critics [256,256], lr=3e-4, UTD=1 - MuJoCo: critic width scales with env difficulty (256/512/1024) - Actor LayerNorm for most envs, iter=2 for Humanoid/HumanoidStandup - Atari: iter=1, FC1024 BRN critics, alpha_lr=3e-5 - Experimental specs for roadmap feature testing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add comprehensive CrossQ benchmark documentation: - BENCHMARKS.md: CrossQ results for Classic, Box2D, MuJoCo (22 envs), Atari (6 games) - Updated plots for all benchmarked environments (CrossQ vs SAC/PPO overlays) - CROSSQ_TRACKER.md: detailed experiment log with v1-v14 progression - IMPROVEMENTS_ROADMAP.md: feature roadmap and priorities - CHANGELOG.md: version history updates - Benchmark skill instructions for operational workflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dstack 0.20.x `files` mount no longer creates /workflow — use default repo clone instead. Remove `files: - ..:/workflow` and `cd /workflow &&` from all configs. Broaden GPU filter to memory: 20GB.. for capacity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clean canonical spec names with best configs from iterative experiments: - LN actor + extended frames for hard envs (HalfCheetah, Walker2d, Ant) - iter=2 + [1024] critics for very hard envs (Humanoid, HumanoidStandup, InvDblPend) - [512] critic upgrades for medium envs (Hopper, InvPend) - Remove all suffixed experiment specs (_ln, _v2, _i2, _7m, _8m, _wn) - Update BENCHMARKS.md spec names to match clean canonical names Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Pendulum: -163.52 → -145.66 (improved, beats SAC -150.97) - LunarLander: 136.25 → 139.21 (marginal improvement) - Regenerate plots for all classic/box2d envs with latest data - Update CROSSQ_TRACKER with Wave 9 clean spec relaunch status - Fix dstack max_duration to 8h for longer MuJoCo runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hopper +8% (1403), Walker +3% (4390), InvPend +4% (878). Clean spec reruns with upgraded configs (LN actors, wider critics). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
InvDblPend [1024] critics too wide for 9-dim state — old v2 [512] scored 8255. Humanoid needs more training time to reduce variance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CrossQ should demonstrate sample efficiency at or below SAC frame budgets. InvDblPend: [1024]→[512] critics (too wide for 9-dim state). All envs capped: HalfCheetah 4M, Hopper/Walker 3M, Ant/InvDblPend/ InvPend/Swimmer 2M, Humanoid/HumanoidStandup/Reacher/Pusher 1M. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… Swimmer 5M Ant needs 3M to plateau at ~5000. Humanoid solves >1000 at 2M. Swimmer is slow learner, needs 5M to reliably hit >200. All runs well under 4h wall time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Removed duplicate CrossQ/SAC lines from stale data folders. Hopper, Walker, InvPend, HumanoidStandup, Reacher now show only benchmark-linked runs (PPO, SAC, CrossQ). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ots from reruns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CartPole: 10% of 200K budget lost to BRN warmup. 300K gives 50% more training (still trivial wall time). 4 attempts averaged 360, target 400. Humanoid: iter=2 at 2M = 125K gradients (8x less than SAC iter=16). iter=4 gives 500K gradients at ~200fps (~2.8h). Humanoid needs high UTD per prior experiments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dients Respect env settings max_frame=2e5. Instead of extending frames, double gradient steps (50K→100K) via iter=2 within same budget. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverted all experimental changes (training_iter=4, warmup=2000, log_alpha_max=1.0, training_start_step=5000) back to original arc settings: iter=1, warmup=5000, log_alpha_max=2.0, start_step=1000. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Last attempt — double gradients within same 200K frame budget. Reverted warmup_steps=5000 from arc spec, only change is iter=2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…v12, iter=2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hopper env setting is max_frame 4e6. Previous CrossQ entry used a 6M frame run. Replaced with 3M run (MA 1168 vs 1403) that respects the env settings limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MsPacman v12 and Pong v14 were experimental runs not on public HF. Replaced with standard-named runs that exist on SLM-Lab/benchmark. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9 of 11 CrossQ MuJoCo entries had stale max_frame values from earlier experimental runs. Updated to match actual t0_spec.json frames. Flagged InvertedPendulum (7M exceeds 4M env settings). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cibility Update 6 max_frame values in crossq_mujoco.yaml to match the runs that produced benchmark scores. Remove -s override instructions from docs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nibatch
From feat/optimization worktree:
- Pinned memory transfers for faster GPU batch prep (ml_util.py)
- torch.profiler integration with --profile flag (torch_profiler.py, control.py)
- PPO MuJoCo minibatch_size 64→256 (verified +17-39% scores)
- PPO Atari max_frame as ${variable} for flexible runs
- BatchRenorm register_buffer fix for proper state_dict serialization
- Remove dead CrossQ calc_v_next method
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…in specs
YAML doesn't recognize bare scientific notation like "4e6" — only "4.0e+6".
Convert all numeric -s values to canonical integer/float form before
substitution. Also fail fast with clear error when ${var} placeholders
remain unsubstituted, preventing silent runtime bugs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert spec changes from optimization worktree that aren't core
optimizations. Minibatch 256 caused -50% PPO HalfCheetah regression.
Atari ${max_frame} variable added unnecessary complexity.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7/8 verification runs scored lower than baselines. pin_memory() made non_blocking=True genuinely async (was silently sync before). Reverting to isolate whether async transfers cause the regression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Supervisor-first trait to Role & Mindset - Add explanatory sentences to each code design principle - Rewrite Agent Teams section: TeamCreate over subagents, panel of agents Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- CrossQ CartPole: 324.10→334.59 (was using strength, not total_reward_ma) - CrossQ Humanoid: 1102.00→1755.29 (same strength vs total_reward_ma issue) - CrossQ Swimmer: fix HF link to correct run folder (_184204 not _134711) - CrossQ Acrobot: mark as warning (score -103.13 does not meet target >-100) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
🎉 This PR is included in version 5.2.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CrossQ benchmark results
Classic Control & Box2D
MuJoCo
MuJoCo wall-clock speedup (CrossQ vs SAC)
Atari (58 games)
CrossQ Atari uses iter=1 with FC1024 critics + slow alpha_lr. Runs at ~320fps (2.5x faster than SAC). CrossQ generally underperforms SAC/PPO on Atari due to cross-batch BN with correlated frames. Full results in
docs/BENCHMARKS.md.Test plan
uv run pytestpasses🤖 Generated with Claude Code