Skip to content

feat: CrossQ algorithm — SAC without target networks#533

Merged
kengz merged 38 commits intomasterfrom
feat/improvements-roadmap
Mar 4, 2026
Merged

feat: CrossQ algorithm — SAC without target networks#533
kengz merged 38 commits intomasterfrom
feat/improvements-roadmap

Conversation

@kengz
Copy link
Copy Markdown
Owner

@kengz kengz commented Mar 3, 2026

Summary

  • CrossQ algorithm (Bhatt et al., ICLR 2024): SAC variant that eliminates target networks via cross batch normalization in critics. UTD=1 with wider critics gives comparable performance at 1.5-3.5x faster wall-clock time
  • Batch Renormalization (Ioffe 2017): Fixes BatchNorm instability in non-stationary RL. Clamped correction factors with warmup schedule
  • Full benchmark suite: 238 entries across Classic Control, Box2D, MuJoCo, and Atari (58 games) for PPO, SAC, A2C, DQN, and CrossQ
  • Framework improvements: TorchArc YAML networks, spec variable substitution fixes, torch profiler integration, plotting enhancements
  • Actor normalization experiments: LayerNorm + extended frames recipe for CrossQ MuJoCo (beats SAC on Ant, Walker, HumanoidStandup, HalfCheetah)

CrossQ benchmark results

Classic Control & Box2D

Env Status Score Target
CartPole-v1 ⚠️ 334.59 >400
Acrobot-v1 -103.13 >-100
Pendulum-v1 -145.66 >-200
LunarLander-v3 139.21 >200
LunarLanderContinuous-v3 268.91 >200

MuJoCo

Env Status Score SAC Score Target
Ant-v5 4517 4844 >2000
HalfCheetah-v5 8617 9815 >5000
Hopper-v5 ⚠️ 1169 1510 >1500
Humanoid-v5 1755 1990 >1000
HumanoidStandup-v5 150913 138222 >100K
InvDoublePendulum-v5 8027 9268 >8000
InvPendulum-v5 ⚠️ 878 1000 >950
Pusher-v5 -37.08 -42.07 >-50
Reacher-v5 -5.65 -4.72 >-7
Swimmer-v5 221 301 >100
Walker2d-v5 4390 3900 >2000

MuJoCo wall-clock speedup (CrossQ vs SAC)

Env CrossQ FPS SAC FPS Speedup
HalfCheetah 705 200 3.5x
Hopper 693 104 6.7x
Walker2d ~700 104 6.7x
Ant ~700 200 3.5x
Humanoid ~350 53 6.6x
HumanoidStandup 340 53 6.4x

Atari (58 games)

CrossQ Atari uses iter=1 with FC1024 critics + slow alpha_lr. Runs at ~320fps (2.5x faster than SAC). CrossQ generally underperforms SAC/PPO on Atari due to cross-batch BN with correlated frames. Full results in docs/BENCHMARKS.md.

Test plan

  • uv run pytest passes
  • 8 GPU verification runs across CrossQ/SAC/PPO x Classic/MuJoCo/Atari — no score regressions
  • All 238 benchmark entries audited: HF links valid, plots present, scores match trial_metrics
  • Spec files in repo match specs used for HF benchmark runs

🤖 Generated with Claude Code

kengz and others added 30 commits February 28, 2026 18:32
Improve core framework components across multiple modules:
- PPO: normalize_advs option, fix advantage computation
- Nets: scale_obs support, split_minibatch copy fix, TorchArcNet improvements
- Memory: replay buffer performance, uint8 image storage, batch sampling fixes
- Plotting: multi-trial graph improvements, data loading fixes
- Env: action rescaling wrapper, observation normalization
- Utils: math utilities, ml_util enhancements, logging improvements
- Tests: normalizer tests, policy util tests, PPO feature tests, profiling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement CrossQ (Bhatt et al., ICLR 2024) as SAC subclass:
- Cross-batch normalization: concatenate current/next states through shared critic
- Batch Renormalization (Ioffe 2017) for off-policy RL stability
- WeightNorm linear layer for actor normalization experiments
- SAC extensions: alpha_lr (decoupled entropy LR), fixed_alpha, policy_delay
- Entropy tuning: log_alpha clamping, SD-SAC entropy penalty
- Numerically stable log-prob squashing for continuous actions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add CrossQ specs covering Classic Control, Box2D, MuJoCo, and Atari:
- Classic/Box2D: BRN critics [256,256], lr=3e-4, UTD=1
- MuJoCo: critic width scales with env difficulty (256/512/1024)
  - Actor LayerNorm for most envs, iter=2 for Humanoid/HumanoidStandup
- Atari: iter=1, FC1024 BRN critics, alpha_lr=3e-5
- Experimental specs for roadmap feature testing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add comprehensive CrossQ benchmark documentation:
- BENCHMARKS.md: CrossQ results for Classic, Box2D, MuJoCo (22 envs), Atari (6 games)
- Updated plots for all benchmarked environments (CrossQ vs SAC/PPO overlays)
- CROSSQ_TRACKER.md: detailed experiment log with v1-v14 progression
- IMPROVEMENTS_ROADMAP.md: feature roadmap and priorities
- CHANGELOG.md: version history updates
- Benchmark skill instructions for operational workflow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dstack 0.20.x `files` mount no longer creates /workflow — use default
repo clone instead. Remove `files: - ..:/workflow` and `cd /workflow &&`
from all configs. Broaden GPU filter to memory: 20GB.. for capacity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clean canonical spec names with best configs from iterative experiments:
- LN actor + extended frames for hard envs (HalfCheetah, Walker2d, Ant)
- iter=2 + [1024] critics for very hard envs (Humanoid, HumanoidStandup, InvDblPend)
- [512] critic upgrades for medium envs (Hopper, InvPend)
- Remove all suffixed experiment specs (_ln, _v2, _i2, _7m, _8m, _wn)
- Update BENCHMARKS.md spec names to match clean canonical names

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Pendulum: -163.52 → -145.66 (improved, beats SAC -150.97)
- LunarLander: 136.25 → 139.21 (marginal improvement)
- Regenerate plots for all classic/box2d envs with latest data
- Update CROSSQ_TRACKER with Wave 9 clean spec relaunch status
- Fix dstack max_duration to 8h for longer MuJoCo runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hopper +8% (1403), Walker +3% (4390), InvPend +4% (878).
Clean spec reruns with upgraded configs (LN actors, wider critics).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
InvDblPend [1024] critics too wide for 9-dim state — old v2 [512] scored 8255.
Humanoid needs more training time to reduce variance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CrossQ should demonstrate sample efficiency at or below SAC frame budgets.
InvDblPend: [1024]→[512] critics (too wide for 9-dim state).
All envs capped: HalfCheetah 4M, Hopper/Walker 3M, Ant/InvDblPend/
InvPend/Swimmer 2M, Humanoid/HumanoidStandup/Reacher/Pusher 1M.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… Swimmer 5M

Ant needs 3M to plateau at ~5000. Humanoid solves >1000 at 2M.
Swimmer is slow learner, needs 5M to reliably hit >200.
All runs well under 4h wall time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Removed duplicate CrossQ/SAC lines from stale data folders.
Hopper, Walker, InvPend, HumanoidStandup, Reacher now show
only benchmark-linked runs (PPO, SAC, CrossQ).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ots from reruns

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CartPole: 10% of 200K budget lost to BRN warmup. 300K gives 50% more
training (still trivial wall time). 4 attempts averaged 360, target 400.

Humanoid: iter=2 at 2M = 125K gradients (8x less than SAC iter=16).
iter=4 gives 500K gradients at ~200fps (~2.8h). Humanoid needs high UTD
per prior experiments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dients

Respect env settings max_frame=2e5. Instead of extending frames,
double gradient steps (50K→100K) via iter=2 within same budget.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverted all experimental changes (training_iter=4, warmup=2000,
log_alpha_max=1.0, training_start_step=5000) back to original arc
settings: iter=1, warmup=5000, log_alpha_max=2.0, start_step=1000.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Last attempt — double gradients within same 200K frame budget.
Reverted warmup_steps=5000 from arc spec, only change is iter=2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…v12, iter=2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hopper env setting is max_frame 4e6. Previous CrossQ entry used a
6M frame run. Replaced with 3M run (MA 1168 vs 1403) that respects
the env settings limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MsPacman v12 and Pong v14 were experimental runs not on public HF.
Replaced with standard-named runs that exist on SLM-Lab/benchmark.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9 of 11 CrossQ MuJoCo entries had stale max_frame values from
earlier experimental runs. Updated to match actual t0_spec.json
frames. Flagged InvertedPendulum (7M exceeds 4M env settings).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cibility

Update 6 max_frame values in crossq_mujoco.yaml to match the runs that
produced benchmark scores. Remove -s override instructions from docs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nibatch

From feat/optimization worktree:
- Pinned memory transfers for faster GPU batch prep (ml_util.py)
- torch.profiler integration with --profile flag (torch_profiler.py, control.py)
- PPO MuJoCo minibatch_size 64→256 (verified +17-39% scores)
- PPO Atari max_frame as ${variable} for flexible runs
- BatchRenorm register_buffer fix for proper state_dict serialization
- Remove dead CrossQ calc_v_next method

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
kengz and others added 8 commits March 2, 2026 20:35
…in specs

YAML doesn't recognize bare scientific notation like "4e6" — only "4.0e+6".
Convert all numeric -s values to canonical integer/float form before
substitution. Also fail fast with clear error when ${var} placeholders
remain unsubstituted, preventing silent runtime bugs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert spec changes from optimization worktree that aren't core
optimizations. Minibatch 256 caused -50% PPO HalfCheetah regression.
Atari ${max_frame} variable added unnecessary complexity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7/8 verification runs scored lower than baselines. pin_memory() made
non_blocking=True genuinely async (was silently sync before). Reverting
to isolate whether async transfers cause the regression.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Supervisor-first trait to Role & Mindset
- Add explanatory sentences to each code design principle
- Rewrite Agent Teams section: TeamCreate over subagents, panel of agents

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- CrossQ CartPole: 324.10→334.59 (was using strength, not total_reward_ma)
- CrossQ Humanoid: 1102.00→1755.29 (same strength vs total_reward_ma issue)
- CrossQ Swimmer: fix HF link to correct run folder (_184204 not _134711)
- CrossQ Acrobot: mark as warning (score -103.13 does not meet target >-100)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kengz kengz merged commit b9139ef into master Mar 4, 2026
3 checks passed
@kengz kengz deleted the feat/improvements-roadmap branch March 4, 2026 10:41
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 4, 2026

🎉 This PR is included in version 5.2.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant