Parallel self-play using Tokio#370
Merged
Merged
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run many self-play games concurrently in one Rust process by routing all NN evaluations through a single eval coordinator thread that batches requests and serves a shared DashMap eval cache keyed on CompactState. - New `self_play.rust` config block + matching `--num-threads`, `--games-per-thread`, `--eval-batch-size`, `--eval-max-wait-ms`, `--eval-cache-max-size` CLI flags. Defaults reproduce sequential behavior (1 worker, batch-of-1, no cache). - `eval_coordinator.rs`: owns the ORT session; collects up to eval_batch_size requests bounded by eval_max_wait_ms from the first request, runs one batched inference, inserts into the cache up to the configured cap (no eviction), and routes results back via oneshot channels. Handles ReloadModel (drop session, clear cache, load new) and Shutdown control messages. - BatchingEvaluator (new Evaluator impl): looks up the shared cache, builds the rotated input tensor in the requesting worker thread on miss, sends an EvalRequest, blocks on the response. - AlphaZeroAgent now holds Box<dyn Evaluator + Send> via a new with_evaluator constructor so workers can plug in BatchingEvaluator while the legacy path keeps using OnnxEvaluator. - mcts::search takes &mut dyn Evaluator so the boxed agent works. - --use-raw-onnx-agent keeps the legacy single-threaded path unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
parallel_games (Python) and games_per_thread (Rust) named the same concept: concurrent games per self-play process. Unify on the games_per_thread name and hoist the Rust-only eval-coordinator knobs (num_threads, eval_batch_size, eval_max_wait_ms, eval_cache_max_size) up to the top of self_play so both sides parse one flat block. Python (train_v2): - v2/config.py: SelfPlayConfig renamed parallel_games -> games_per_thread; added num_threads/eval_batch_size/eval_max_wait_ms/eval_cache_max_size with defaults that preserve current Python behavior. - v2/self_play.py, tune_selfplay.py, test/config_test.py: renamed. Rust: - selfplay_config.rs: dropped the self_play.rust sub-section; moved its fields up to SelfPlayWorkerConfig alongside the renamed games_per_thread. - bin/selfplay.rs: ResolvedRustConfig now reads SelfPlayWorkerConfig directly. YAMLs: parallel_games -> games_per_thread in experiments/*.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ds_per_process Makes the YAML field names match what they actually control: num_processes is the count of self-play subprocesses, threads_per_process is the per-process worker-thread count inside Rust self-play. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Async tokio worker model, leaf-parallel MCTS with virtual loss, pipelined eval coordinator, tree reuse across moves. Targets ≥4x games/sec on the reference hardware (RTX 5080 + 12c/24t) by saturating both CPU and GPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
24 tasks covering config schema changes, MCTS extensions (virtual loss, tree reuse, in-arena Dirichlet noise), async pipelined eval coordinator, LeafParallelMCTS, async game runner, selfplay binary rewiring, profile counters, benchmark script, and manual perf verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops threads_per_process/games_per_thread; adds games_per_process, leaf_parallelism, virtual_loss, enable_tree_reuse, mcts_worker_threads. Raises eval_cache_max_size default from 0 to 100000. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror Python schema changes: drops threads_per_process / games_per_thread, adds games_per_process, leaf_parallelism, virtual_loss, enable_tree_reuse, mcts_worker_threads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lays groundwork for tree reuse: prior_clean stays untouched, and noise can be re-mixed into the new root's children after subtree promotion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three dedicated OS threads connected by capacity-1 sync_channels, fronted by a tokio mpsc. finalize_policy is parallelized via rayon in the post-process stage. Old eval_coordinator.rs remains until cleanup task. Also fix pre-existing selfplay.rs compile error: SelfPlayWorkerConfig no longer has threads_per_process/games_per_thread after the Task 3 rework; map games_per_thread YAML fallback to games_per_process. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-wait - Add one-time cache-saturation warning when cache.len() >= cache_max - Add debug_assert to verify batch_len > 0 before division - Document zero-wait (max_wait=0) semantics forcing batch-size-1 - Annotate PostIn::Reload handling with context comment
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Owns its arena across calls so subsequent task can enable tree reuse. Each outer iteration selects up to K leaves with virtual loss, classifies them as Terminal/Hit/Miss, sends miss requests to the eval pipeline, then awaits in selection order while undoing VL + backpropping real results.
Stub coordinator emulates the eval pipeline (uniform priors, value=0). With K=1, vl=0, no noise, no tree reuse, the leaf-parallel search behaves like sequential MCTS and we verify the configured iteration count is run. Manual tokio runtime via Builder rather than #[tokio::test] because the workspace sets panic = "abort" in [profile.dev], which interacts badly with the tokio::test macro. The mcts instance holds an internal Sender clone, so it must be dropped before tx so the stub's rx.recv() returns None and the stub task exits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both batch and continuous modes now use the new pipelined eval coordinator, spawn one async game task per concurrent game, and rely on LeafParallelMCTS for per-game search. Old BatchingEvaluator/coordinator scaffolding is no longer referenced; the legacy --use-raw-onnx-agent path is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The new pipelined coordinator in eval_pipeline.rs replaces it entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove references to the now-deleted BatchingEvaluator in documentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- selfplay.rs: Updated doc comment to reflect tokio + leaf-parallel MCTS architecture - self_play.py: Renamed games_per_thread to games_per_process - tune_selfplay.py: Renamed all occurrences of games_per_thread to games_per_process across function params, dict keys, CLI help text, and override strings
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
No behavioral change; rustfmt formatting only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Bump Cargo.toml edition from 2021 to 2024. - Escape the now-reserved `gen` keyword: `rng.gen()` → `rng.r#gen()`. - Fix three rand callsites in minimax.rs, selfplay_game.rs, agent.rs. - Tighten two `apply_dirichlet_noise` / `expand_node` filter patterns (`|(_, &p)|` → `|&(_, &p)|`) — edition 2024 forbids explicit deref inside an implicitly-borrowing pattern. - Allow `unsafe_op_in_unsafe_fn` crate-wide in lib.rs to suppress the edition-2024 warnings emitted by pyo3 0.22's macro-generated code (121 warnings before, 0 after). cargo check, cargo build --release, and cargo test --features binary all run clean (153 tests pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump opset_version to 18 (matches what the runtime auto-converts to) and pass dynamo=False to keep the legacy exporter that supports dynamic_axes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
It fails sometimes due to randomness.
Self-play is unusably slow for 9x9/10-wall. Profiling showed the resnet runs entirely on CPU (GPU at 7-9%, no selfplay compute process) because load_session registers no execution provider and ort has no cuda feature. Spec captures the root cause, the CUDA-EP-with-CPU- fallback fix, and a Blackwell sm_120 de-risking spike to run first.
Step-by-step plan to register the ONNX Runtime CUDA execution provider in load_session behind a gpu Cargo feature, with a CPU fallback, an sm_120 go/no-go benchmark gate (with a load-dynamic contingency), a CUDA-vs-CPU numerics test, and a results writeup.
Self-play resnet inference ran entirely on CPU because load_session registered no execution provider. Add a gpu Cargo feature (ort/cuda) and, behind it, register the CUDA execution provider with a CPU fallback plus Level3 graph optimization. CPU-only builds are unaffected.
The bundled ORT CUDA provider misses libcudnn.so.9 at runtime (not in system lib path). Switching to ort/load-dynamic lets us point at the onnxruntime-gpu 1.26.0 wheel installed in the venv, which properly activates CUDAExecutionProvider on the RTX 5080 (sm_120/Blackwell). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Captures the full before/after benchmark: CPU baseline ~7,600 evals/sec vs GPU steady-state ~38,000 evals/sec (~5× speedup) on the RTX 5080. Documents Path B (load-dynamic + onnxruntime-gpu 1.26.0) required because bundled ORT CUDA 12 libs are mismatched against the host CUDA 13 runtime. Notes required env vars for GPU inference and flags train_v2.py subprocess env inheritance as a follow-up. Identifies CPU-side MCTS pathfinding as the next bottleneck (GPU at 20-31% util, batch_wait_ms=0). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A gpu-feature selfplay binary uses ort/load-dynamic and panics or silently falls back to CPU when ORT_DYLIB_PATH is absent. Bail with a clear message at session-load time so the operator knows exactly what to set instead of hunting through ORT internals.
train_v2.py spawned selfplay without an explicit env, so the GPU path only worked if ORT_DYLIB_PATH/LD_LIBRARY_PATH were exported by hand. Discover them from the installed onnxruntime-gpu and nvidia wheels and inject them into the subprocess, and pin onnxruntime-gpu so a fresh environment can reproduce the GPU build.
CI resolved ort 2.0.0-rc.12 (Cargo.lock is gitignored, and the requirement permitted any rc) while local builds used rc.11, so the build broke only in CI. rc.12 made two breaking changes to the SessionBuilder chain that load_session now hits: - with_optimization_level / with_execution_providers return Result<Self, Error<SessionBuilder>>; that error carries the builder back and is not Send/Sync, so anyhow's .context() no longer applies. Convert through the error's Display instead. - commit_from_file now takes &mut self, so the final builder must be a mutable binding. Pin ort to =2.0.0-rc.12 so CI and local checkouts resolve the same pre-release and stop diverging on its frequent breaking changes.
alejandromarcu
approved these changes
May 26, 2026
Owner
Author
|
Hmmm... not sure why that test is hanging in CI. Debugging now. |
The real-model selfplay parity test hangs intermittently (only under CI/contention; not reproducible in isolation across 130+ local runs). run_python used Command::output() with no timeout, so any stall blocked the test indefinitely (10+ min locally; killed at 60s on CI). Pin OMP/MKL/OPENBLAS threads to 1 for the python child (avoids torch CPU thread-pool oversubscription across the concurrent parity tests, the suspected trigger; verified parity-safe for these small models), and add a 45s per-subprocess timeout that on expiry sends SIGABRT with PYTHONFAULTHANDLER=1 to dump the child's all-thread traceback, then fails with it -- turning an indefinite hang into a fast, self-diagnosing failure that captures the stack from wherever it actually reproduces.
The CI log (PR #370 run 26534399531) shows test_real_model_selfplay_trace_and_npz_matches_python ran for 6h until GitHub's default job timeout cancelled it. My prior 45s timeout in run_python did not fire, which proves the hang isn't in the python subprocess at all -- it's in the rust side of that test (generate_rust_real_model_trace_and_write_npz, which loads an ONNX model via AlphaZeroAgent::new and runs MCTS with ORT CPU inference). Other python_consistency tests that only use run_python all passed. None of the three ORT Session::builder() call sites limited threads. ORT defaults to all CPU cores for intra-op parallelism; parallel CPU tests each loading a session oversubscribe its threadpool and intermittently deadlock -- the same OpenMP-pool deadlock pattern we already hit with torch, just for ORT. Pin intra-op threads to 1 everywhere: parity-safe (the real-model parity test still passes with exact hex match), safe for production self-play (GPU does the math; the eval pipeline gets throughput from leaf-parallel batching, not intra-op threading).
Reproduced the CI hang locally: with --all-features (which enables
gpu = ort/cuda + ort/load-dynamic) and no ORT_DYLIB_PATH set,
AlphaZeroAgent::new -> OnnxEvaluator::new -> Session::builder()
deadlocks on a futex inside ort's dynamic-loader path, before reaching
with_intra_threads / commit_from_file. Earlier "evidence" for a
threadpool deadlock was wrong; the real cause is gpu/load-dynamic
without a dylib path. Per-thread wchan of the hung test process:
tid 3214454 state=S name=quoridor_rs-a83 wchan=futex_do_wait
tid 3214455 state=S name=python_consiste wchan=futex_do_wait
Two fixes:
- Change the CI test step from `cargo test --all-features` to
`cargo test --features binary`. CI has no GPU; enabling load-dynamic
+ cuda is pointless and is the source of the deadlock. With
--features binary the test binary uses statically-linked CPU ORT and
the previously-hanging test passes in ~7s.
- Add the same ORT_DYLIB_PATH guard that eval_pipeline::load_session
already has to OnnxAgent::new and OnnxEvaluator::new, so any future
gpu build without ORT_DYLIB_PATH fails immediately with a clear error
("gpu feature is enabled but ORT_DYLIB_PATH is not set...") instead
of silently deadlocking.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of changes
Instructions for trying this out
pip install -r deep_quoridor/requirements.txtcargo build --release --all-features --bin selfplaypython -O deep_quoridor/src/train_v2.py <path to downloaded config.yaml> --overrides run_id=<unique name for the run>Multi-threaded rust self-play with Tokio
There are now a fixed number of worker threads created, and any of them can work on the MCTS tree of whichever game is ready. This lets you choose a number of threads based on the number of cores in the CPU, and choose a number of parallel games to take advantage of the efficiency of batch evaluations. Interestingly, Claude implemented this in a way that immediately picks up a new model version, even mid-game.
Virtual loss
We want to be able to run MCTS searches on a tree without waiting for the evaluator each time. The problem is that every search will choose the same path through the tree, since the total value and visit counts will not have changed. Virtual loss fixes this by having each search temporarily back-propagate a fixed number of losses (defaults to 3, which alphazero used). This makes that path through the tree less attractive to the next thread. Once the evaluations are ready, the virtual losses are removed, and the actual evaluations are back propagated.
If the number of searches done before evaluations are ready (call this K) is too large, then the overall statistics of MCTS will be thrown off. Anecdotally, using K=16 for mcts_n=400 seems to cause no noticeable decrease in performance.
If we want to we could try techniques that are newer/better than virtual loss, like "virtual mean"
Self-play config variables renamed/added