feat: CrossQ algorithm — SAC without target networks by kengz · Pull Request #533 · kengz/SLM-Lab

kengz · 2026-03-03T16:52:35Z

Summary

CrossQ algorithm (Bhatt et al., ICLR 2024): SAC variant that eliminates target networks via cross batch normalization in critics. UTD=1 with wider critics gives comparable performance at 1.5-3.5x faster wall-clock time
Batch Renormalization (Ioffe 2017): Fixes BatchNorm instability in non-stationary RL. Clamped correction factors with warmup schedule
Full benchmark suite: 238 entries across Classic Control, Box2D, MuJoCo, and Atari (58 games) for PPO, SAC, A2C, DQN, and CrossQ
Framework improvements: TorchArc YAML networks, spec variable substitution fixes, torch profiler integration, plotting enhancements
Actor normalization experiments: LayerNorm + extended frames recipe for CrossQ MuJoCo (beats SAC on Ant, Walker, HumanoidStandup, HalfCheetah)

CrossQ benchmark results

Classic Control & Box2D

Env	Status	Score	Target
CartPole-v1	⚠️	334.59	>400
Acrobot-v1	✅	-103.13	>-100
Pendulum-v1	✅	-145.66	>-200
LunarLander-v3	❌	139.21	>200
LunarLanderContinuous-v3	✅	268.91	>200

MuJoCo

Env	Status	Score	SAC Score	Target
Ant-v5	✅	4517	4844	>2000
HalfCheetah-v5	✅	8617	9815	>5000
Hopper-v5	⚠️	1169	1510	>1500
Humanoid-v5	✅	1755	1990	>1000
HumanoidStandup-v5	✅	150913	138222	>100K
InvDoublePendulum-v5	✅	8027	9268	>8000
InvPendulum-v5	⚠️	878	1000	>950
Pusher-v5	✅	-37.08	-42.07	>-50
Reacher-v5	✅	-5.65	-4.72	>-7
Swimmer-v5	✅	221	301	>100
Walker2d-v5	✅	4390	3900	>2000

MuJoCo wall-clock speedup (CrossQ vs SAC)

Env	CrossQ FPS	SAC FPS	Speedup
HalfCheetah	705	200	3.5x
Hopper	693	104	6.7x
Walker2d	~700	104	6.7x
Ant	~700	200	3.5x
Humanoid	~350	53	6.6x
HumanoidStandup	340	53	6.4x

Atari (58 games)

CrossQ Atari uses iter=1 with FC1024 critics + slow alpha_lr. Runs at ~320fps (2.5x faster than SAC). CrossQ generally underperforms SAC/PPO on Atari due to cross-batch BN with correlated frames. Full results in docs/BENCHMARKS.md.

Test plan

uv run pytest passes
8 GPU verification runs across CrossQ/SAC/PPO x Classic/MuJoCo/Atari — no score regressions
All 238 benchmark entries audited: HF links valid, plots present, scores match trial_metrics
Spec files in repo match specs used for HF benchmark runs

🤖 Generated with Claude Code

Improve core framework components across multiple modules: - PPO: normalize_advs option, fix advantage computation - Nets: scale_obs support, split_minibatch copy fix, TorchArcNet improvements - Memory: replay buffer performance, uint8 image storage, batch sampling fixes - Plotting: multi-trial graph improvements, data loading fixes - Env: action rescaling wrapper, observation normalization - Utils: math utilities, ml_util enhancements, logging improvements - Tests: normalizer tests, policy util tests, PPO feature tests, profiling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Implement CrossQ (Bhatt et al., ICLR 2024) as SAC subclass: - Cross-batch normalization: concatenate current/next states through shared critic - Batch Renormalization (Ioffe 2017) for off-policy RL stability - WeightNorm linear layer for actor normalization experiments - SAC extensions: alpha_lr (decoupled entropy LR), fixed_alpha, policy_delay - Entropy tuning: log_alpha clamping, SD-SAC entropy penalty - Numerically stable log-prob squashing for continuous actions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add CrossQ specs covering Classic Control, Box2D, MuJoCo, and Atari: - Classic/Box2D: BRN critics [256,256], lr=3e-4, UTD=1 - MuJoCo: critic width scales with env difficulty (256/512/1024) - Actor LayerNorm for most envs, iter=2 for Humanoid/HumanoidStandup - Atari: iter=1, FC1024 BRN critics, alpha_lr=3e-5 - Experimental specs for roadmap feature testing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add comprehensive CrossQ benchmark documentation: - BENCHMARKS.md: CrossQ results for Classic, Box2D, MuJoCo (22 envs), Atari (6 games) - Updated plots for all benchmarked environments (CrossQ vs SAC/PPO overlays) - CROSSQ_TRACKER.md: detailed experiment log with v1-v14 progression - IMPROVEMENTS_ROADMAP.md: feature roadmap and priorities - CHANGELOG.md: version history updates - Benchmark skill instructions for operational workflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dstack 0.20.x `files` mount no longer creates /workflow — use default repo clone instead. Remove `files: - ..:/workflow` and `cd /workflow &&` from all configs. Broaden GPU filter to memory: 20GB.. for capacity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Clean canonical spec names with best configs from iterative experiments: - LN actor + extended frames for hard envs (HalfCheetah, Walker2d, Ant) - iter=2 + [1024] critics for very hard envs (Humanoid, HumanoidStandup, InvDblPend) - [512] critic upgrades for medium envs (Hopper, InvPend) - Remove all suffixed experiment specs (_ln, _v2, _i2, _7m, _8m, _wn) - Update BENCHMARKS.md spec names to match clean canonical names Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Pendulum: -163.52 → -145.66 (improved, beats SAC -150.97) - LunarLander: 136.25 → 139.21 (marginal improvement) - Regenerate plots for all classic/box2d envs with latest data - Update CROSSQ_TRACKER with Wave 9 clean spec relaunch status - Fix dstack max_duration to 8h for longer MuJoCo runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Hopper +8% (1403), Walker +3% (4390), InvPend +4% (878). Clean spec reruns with upgraded configs (LN actors, wider critics). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

InvDblPend [1024] critics too wide for 9-dim state — old v2 [512] scored 8255. Humanoid needs more training time to reduce variance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CrossQ should demonstrate sample efficiency at or below SAC frame budgets. InvDblPend: [1024]→[512] critics (too wide for 9-dim state). All envs capped: HalfCheetah 4M, Hopper/Walker 3M, Ant/InvDblPend/ InvPend/Swimmer 2M, Humanoid/HumanoidStandup/Reacher/Pusher 1M. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… Swimmer 5M Ant needs 3M to plateau at ~5000. Humanoid solves >1000 at 2M. Swimmer is slow learner, needs 5M to reliably hit >200. All runs well under 4h wall time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Removed duplicate CrossQ/SAC lines from stale data folders. Hopper, Walker, InvPend, HumanoidStandup, Reacher now show only benchmark-linked runs (PPO, SAC, CrossQ). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ots from reruns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CartPole: 10% of 200K budget lost to BRN warmup. 300K gives 50% more training (still trivial wall time). 4 attempts averaged 360, target 400. Humanoid: iter=2 at 2M = 125K gradients (8x less than SAC iter=16). iter=4 gives 500K gradients at ~200fps (~2.8h). Humanoid needs high UTD per prior experiments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dients Respect env settings max_frame=2e5. Instead of extending frames, double gradient steps (50K→100K) via iter=2 within same budget. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rsity

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reverted all experimental changes (training_iter=4, warmup=2000, log_alpha_max=1.0, training_start_step=5000) back to original arc settings: iter=1, warmup=5000, log_alpha_max=2.0, start_step=1000. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Last attempt — double gradients within same 200K frame budget. Reverted warmup_steps=5000 from arc spec, only change is iter=2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…v12, iter=2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Hopper env setting is max_frame 4e6. Previous CrossQ entry used a 6M frame run. Replaced with 3M run (MA 1168 vs 1403) that respects the env settings limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MsPacman v12 and Pong v14 were experimental runs not on public HF. Replaced with standard-named runs that exist on SLM-Lab/benchmark. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

9 of 11 CrossQ MuJoCo entries had stale max_frame values from earlier experimental runs. Updated to match actual t0_spec.json frames. Flagged InvertedPendulum (7M exceeds 4M env settings). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…cibility Update 6 max_frame values in crossq_mujoco.yaml to match the runs that produced benchmark scores. Remove -s override instructions from docs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nibatch From feat/optimization worktree: - Pinned memory transfers for faster GPU batch prep (ml_util.py) - torch.profiler integration with --profile flag (torch_profiler.py, control.py) - PPO MuJoCo minibatch_size 64→256 (verified +17-39% scores) - PPO Atari max_frame as ${variable} for flexible runs - BatchRenorm register_buffer fix for proper state_dict serialization - Remove dead CrossQ calc_v_next method Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…in specs YAML doesn't recognize bare scientific notation like "4e6" — only "4.0e+6". Convert all numeric -s values to canonical integer/float form before substitution. Also fail fast with clear error when ${var} placeholders remain unsubstituted, preventing silent runtime bugs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert spec changes from optimization worktree that aren't core optimizations. Minibatch 256 caused -50% PPO HalfCheetah regression. Atari ${max_frame} variable added unnecessary complexity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

7/8 verification runs scored lower than baselines. pin_memory() made non_blocking=True genuinely async (was silently sync before). Reverting to isolate whether async transfers cause the regression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add Supervisor-first trait to Role & Mindset - Add explanatory sentences to each code design principle - Rewrite Agent Teams section: TeamCreate over subagents, panel of agents Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- CrossQ CartPole: 324.10→334.59 (was using strength, not total_reward_ma) - CrossQ Humanoid: 1102.00→1755.29 (same strength vs total_reward_ma issue) - CrossQ Swimmer: fix HF link to correct run folder (_184204 not _134711) - CrossQ Acrobot: mark as warning (score -103.13 does not meet target >-100) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-04T10:41:34Z

🎉 This PR is included in version 5.2.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

kengz and others added 30 commits February 28, 2026 18:32

docs: update CrossQ MuJoCo benchmark scores and plots

267a696

Hopper +8% (1403), Walker +3% (4390), InvPend +4% (878). Clean spec reruns with upgraded configs (LN actors, wider critics). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: update CrossQ HumanoidStandup and Reacher HF links to clean-nam…

7f82b2e

…e runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: CrossQ InvDblPend critic [1024]→[512], Humanoid 3.5M→4M frames

92803f7

InvDblPend [1024] critics too wide for 9-dim state — old v2 [512] scored 8255. Humanoid needs more training time to reduce variance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: regenerate 5 MuJoCo plots with clean legend entries

cb74867

Removed duplicate CrossQ/SAC lines from stale data folders. Hopper, Walker, InvPend, HumanoidStandup, Reacher now show only benchmark-linked runs (PPO, SAC, CrossQ). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: update CrossQ Ant, HalfCheetah, InvDoublePendulum scores and pl…

63eefba

…ots from reruns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: update CrossQ LunarLanderContinuous score 249.85→268.91 from rerun

5cafac8

fix: CartPole revert to 200K frames, add training_iter=2 for more gra…

bda2611

…dients Respect env settings max_frame=2e5. Instead of extending frames, double gradient steps (50K→100K) via iter=2 within same budget. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: CartPole training_iter=4, BRN warmup=2000 for more gradients

f57d843

fix: CartPole training_start_step=5000 for better initial buffer dive…

3e6102d

…rsity

docs: update CrossQ Humanoid score 1850→1102 and plot

93804b7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: CartPole training_iter=2 for moderate UTD bump

1323766

Last attempt — double gradients within same 200K frame budget. Reverted warmup_steps=5000 from arc spec, only change is iter=2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: update CrossQ CartPole score 405.88→324.10 from non-arc rerun (…

a3eeb72

…v12, iter=2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: graduate CrossQ CartPole HF link to public benchmark

7c093de

docs: regenerate all CrossQ benchmark plots

ba0d861

docs: use 3M Hopper run (within env settings) instead of 6M

13a32b7

Hopper env setting is max_frame 4e6. Previous CrossQ entry used a 6M frame run. Replaced with 3M run (MA 1168 vs 1403) that respects the env settings limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: fix CrossQ Atari HF links to standard-named folders on public repo

61b5c97

MsPacman v12 and Pong v14 were experimental runs not on public HF. Replaced with standard-named runs that exist on SLM-Lab/benchmark. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: align CrossQ MuJoCo specs with actual benchmark runs for reprodu…

2cf11b0

…cibility Update 6 max_frame values in crossq_mujoco.yaml to match the runs that produced benchmark scores. Remove -s override instructions from docs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kengz and others added 8 commits March 2, 2026 20:35

feat: bump version to 5.2.0 — CrossQ algorithm

6e0f68b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: remove CrossQ tracker doc (superseded by BENCHMARKS.md)

1b62a17

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: remove improvements roadmap doc (work completed)

e936e75

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kengz merged commit b9139ef into master Mar 4, 2026
3 checks passed

kengz deleted the feat/improvements-roadmap branch March 4, 2026 10:41

github-actions bot added the released label Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CrossQ algorithm — SAC without target networks#533

feat: CrossQ algorithm — SAC without target networks#533
kengz merged 38 commits intomasterfrom
feat/improvements-roadmap

kengz commented Mar 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kengz commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

CrossQ benchmark results

Classic Control & Box2D

MuJoCo

MuJoCo wall-clock speedup (CrossQ vs SAC)

Atari (58 games)

Test plan

Uh oh!

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kengz commented Mar 3, 2026 •

edited

Loading