Parallel self-play using Tokio by jonbinney · Pull Request #370 · jonbinney/deep_rabbit_hole

jonbinney · 2026-05-23T17:04:27Z

Summary of changes

multi-threaded self-play in rust, using the tokio asynchronous library
evaluation cache in rust self-play. This is shared across all worker threads and all games.
"virtual loss" so that MCTS search can be run multiple times on a single tree without waiting for each evaluation
After choosing a move, the rust alphazero agent now reuses that part of the MCTS subtree to save work on the next move.
Renamed self-play config variables that control parallelism so that they make sense for both the rust and the python multithread/multiprocess designs
Updated the rust package to use the 2024 edition of rust instead of the 2021 edition.
included the design and plan docs that the claude superpowers plugin created. @adamantivm will like this :-)

Instructions for trying this out

Download this config file
Create a python virtual environment, activate it, and run pip install -r deep_quoridor/requirements.txt
In the deep_quoridor/rust directory, run cargo build --release --all-features --bin selfplay
From the root directory of the repo, run python -O deep_quoridor/src/train_v2.py <path to downloaded config.yaml> --overrides run_id=<unique name for the run>

Multi-threaded rust self-play with Tokio

There are now a fixed number of worker threads created, and any of them can work on the MCTS tree of whichever game is ready. This lets you choose a number of threads based on the number of cores in the CPU, and choose a number of parallel games to take advantage of the efficiency of batch evaluations. Interestingly, Claude implemented this in a way that immediately picks up a new model version, even mid-game.

Virtual loss

We want to be able to run MCTS searches on a tree without waiting for the evaluator each time. The problem is that every search will choose the same path through the tree, since the total value and visit counts will not have changed. Virtual loss fixes this by having each search temporarily back-propagate a fixed number of losses (defaults to 3, which alphazero used). This makes that path through the tree less attractive to the next thread. Once the evaluations are ready, the virtual losses are removed, and the actual evaluations are back propagated.

If the number of searches done before evaluations are ready (call this K) is too large, then the overall statistics of MCTS will be thrown off. Anecdotally, using K=16 for mcts_n=400 seems to cause no noticeable decrease in performance.

If we want to we could try techniques that are newer/better than virtual loss, like "virtual mean"

Self-play config variables renamed/added

num_workers is now num_processes to be more explicit. I've been always using 1 rust selfplay process, to take advantage of the shared evaluator cache.
parallel_games is now games_per_process to make it clearer. In python, each self-play process has exactly one thread that handles all of those games. In rust there can be more than one thread.
mcts_worker_threads is only used in rust at the moment. This should be set as high as the CPU can handle. Each of these threads can work on any of the games.
leaf_parallelism is the number of searches that can be done on a single MCTS tree before waiting for evaluations from the GPU. (virtual loss values are used until the evaluations are ready)
virtual_loss is the number of losses temporarily backpropagated for each node waiting on an evaluation.
enable_tree_reuse: if true, the mcts subtree of the chosen move is reused for the next move.
eval_cache_max_size is used to limit the evaluator cache size. one million is a pretty safe value; set it as high as you can depending on the amount of memory the machine has.
eval_batch_size: the evaluator batches up evaluation requests so that it can send them all to the GPU together. this is the maximum number of evaluations it can send to the GPU at once. setting this larger increases evaluation bandwidth, but also increases latency. if the latency is too high, mcts search threads running on the CPU will be blocked while waiting for the evaluation.
eval_max_wait_ms: we don't want to wait forever for a full size batch. After this many milliseconds we just run whatever batch is ready to go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run many self-play games concurrently in one Rust process by routing all NN evaluations through a single eval coordinator thread that batches requests and serves a shared DashMap eval cache keyed on CompactState. - New `self_play.rust` config block + matching `--num-threads`, `--games-per-thread`, `--eval-batch-size`, `--eval-max-wait-ms`, `--eval-cache-max-size` CLI flags. Defaults reproduce sequential behavior (1 worker, batch-of-1, no cache). - `eval_coordinator.rs`: owns the ORT session; collects up to eval_batch_size requests bounded by eval_max_wait_ms from the first request, runs one batched inference, inserts into the cache up to the configured cap (no eviction), and routes results back via oneshot channels. Handles ReloadModel (drop session, clear cache, load new) and Shutdown control messages. - BatchingEvaluator (new Evaluator impl): looks up the shared cache, builds the rotated input tensor in the requesting worker thread on miss, sends an EvalRequest, blocks on the response. - AlphaZeroAgent now holds Box<dyn Evaluator + Send> via a new with_evaluator constructor so workers can plug in BatchingEvaluator while the legacy path keeps using OnnxEvaluator. - mcts::search takes &mut dyn Evaluator so the boxed agent works. - --use-raw-onnx-agent keeps the legacy single-threaded path unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

parallel_games (Python) and games_per_thread (Rust) named the same concept: concurrent games per self-play process. Unify on the games_per_thread name and hoist the Rust-only eval-coordinator knobs (num_threads, eval_batch_size, eval_max_wait_ms, eval_cache_max_size) up to the top of self_play so both sides parse one flat block. Python (train_v2): - v2/config.py: SelfPlayConfig renamed parallel_games -> games_per_thread; added num_threads/eval_batch_size/eval_max_wait_ms/eval_cache_max_size with defaults that preserve current Python behavior. - v2/self_play.py, tune_selfplay.py, test/config_test.py: renamed. Rust: - selfplay_config.rs: dropped the self_play.rust sub-section; moved its fields up to SelfPlayWorkerConfig alongside the renamed games_per_thread. - bin/selfplay.rs: ResolvedRustConfig now reads SelfPlayWorkerConfig directly. YAMLs: parallel_games -> games_per_thread in experiments/*.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ds_per_process Makes the YAML field names match what they actually control: num_processes is the count of self-play subprocesses, threads_per_process is the per-process worker-thread count inside Rust self-play. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Async tokio worker model, leaf-parallel MCTS with virtual loss, pipelined eval coordinator, tree reuse across moves. Targets ≥4x games/sec on the reference hardware (RTX 5080 + 12c/24t) by saturating both CPU and GPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

24 tasks covering config schema changes, MCTS extensions (virtual loss, tree reuse, in-arena Dirichlet noise), async pipelined eval coordinator, LeafParallelMCTS, async game runner, selfplay binary rewiring, profile counters, benchmark script, and manual perf verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops threads_per_process/games_per_thread; adds games_per_process, leaf_parallelism, virtual_loss, enable_tree_reuse, mcts_worker_threads. Raises eval_cache_max_size default from 0 to 100000. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror Python schema changes: drops threads_per_process / games_per_thread, adds games_per_process, leaf_parallelism, virtual_loss, enable_tree_reuse, mcts_worker_threads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lays groundwork for tree reuse: prior_clean stays untouched, and noise can be re-mixed into the new root's children after subtree promotion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three dedicated OS threads connected by capacity-1 sync_channels, fronted by a tokio mpsc. finalize_policy is parallelized via rayon in the post-process stage. Old eval_coordinator.rs remains until cleanup task. Also fix pre-existing selfplay.rs compile error: SelfPlayWorkerConfig no longer has threads_per_process/games_per_thread after the Task 3 rework; map games_per_thread YAML fallback to games_per_process. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…-wait - Add one-time cache-saturation warning when cache.len() >= cache_max - Add debug_assert to verify batch_len > 0 before division - Document zero-wait (max_wait=0) semantics forcing batch-size-1 - Annotate PostIn::Reload handling with context comment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Owns its arena across calls so subsequent task can enable tree reuse. Each outer iteration selects up to K leaves with virtual loss, classifies them as Terminal/Hit/Miss, sends miss requests to the eval pipeline, then awaits in selection order while undoing VL + backpropping real results.

Stub coordinator emulates the eval pipeline (uniform priors, value=0). With K=1, vl=0, no noise, no tree reuse, the leaf-parallel search behaves like sequential MCTS and we verify the configured iteration count is run. Manual tokio runtime via Builder rather than #[tokio::test] because the workspace sets panic = "abort" in [profile.dev], which interacts badly with the tokio::test macro. The mcts instance holds an internal Sender clone, so it must be dropped before tx so the stub's rx.recv() returns None and the stub task exits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Both batch and continuous modes now use the new pipelined eval coordinator, spawn one async game task per concurrent game, and rely on LeafParallelMCTS for per-game search. Old BatchingEvaluator/coordinator scaffolding is no longer referenced; the legacy --use-raw-onnx-agent path is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The new pipelined coordinator in eval_pipeline.rs replaces it entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Remove references to the now-deleted BatchingEvaluator in documentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- selfplay.rs: Updated doc comment to reflect tokio + leaf-parallel MCTS architecture - self_play.py: Renamed games_per_thread to games_per_process - tune_selfplay.py: Renamed all occurrences of games_per_thread to games_per_process across function params, dict keys, CLI help text, and override strings

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

No behavioral change; rustfmt formatting only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Bump Cargo.toml edition from 2021 to 2024. - Escape the now-reserved `gen` keyword: `rng.gen()` → `rng.r#gen()`. - Fix three rand callsites in minimax.rs, selfplay_game.rs, agent.rs. - Tighten two `apply_dirichlet_noise` / `expand_node` filter patterns (`|(_, &p)|` → `|&(_, &p)|`) — edition 2024 forbids explicit deref inside an implicitly-borrowing pattern. - Allow `unsafe_op_in_unsafe_fn` crate-wide in lib.rs to suppress the edition-2024 warnings emitted by pyo3 0.22's macro-generated code (121 warnings before, 0 after). cargo check, cargo build --release, and cargo test --features binary all run clean (153 tests pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bump opset_version to 18 (matches what the runtime auto-converts to) and pass dynamo=False to keep the legacy exporter that supports dynamic_axes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

It fails sometimes due to randomness.

Self-play is unusably slow for 9x9/10-wall. Profiling showed the resnet runs entirely on CPU (GPU at 7-9%, no selfplay compute process) because load_session registers no execution provider and ort has no cuda feature. Spec captures the root cause, the CUDA-EP-with-CPU- fallback fix, and a Blackwell sm_120 de-risking spike to run first.

Step-by-step plan to register the ONNX Runtime CUDA execution provider in load_session behind a gpu Cargo feature, with a CPU fallback, an sm_120 go/no-go benchmark gate (with a load-dynamic contingency), a CUDA-vs-CPU numerics test, and a results writeup.

Self-play resnet inference ran entirely on CPU because load_session registered no execution provider. Add a gpu Cargo feature (ort/cuda) and, behind it, register the CUDA execution provider with a CPU fallback plus Level3 graph optimization. CPU-only builds are unaffected.

The bundled ORT CUDA provider misses libcudnn.so.9 at runtime (not in system lib path). Switching to ort/load-dynamic lets us point at the onnxruntime-gpu 1.26.0 wheel installed in the venv, which properly activates CUDAExecutionProvider on the RTX 5080 (sm_120/Blackwell). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Captures the full before/after benchmark: CPU baseline ~7,600 evals/sec vs GPU steady-state ~38,000 evals/sec (~5× speedup) on the RTX 5080. Documents Path B (load-dynamic + onnxruntime-gpu 1.26.0) required because bundled ORT CUDA 12 libs are mismatched against the host CUDA 13 runtime. Notes required env vars for GPU inference and flags train_v2.py subprocess env inheritance as a follow-up. Identifies CPU-side MCTS pathfinding as the next bottleneck (GPU at 20-31% util, batch_wait_ms=0). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

A gpu-feature selfplay binary uses ort/load-dynamic and panics or silently falls back to CPU when ORT_DYLIB_PATH is absent. Bail with a clear message at session-load time so the operator knows exactly what to set instead of hunting through ORT internals.

train_v2.py spawned selfplay without an explicit env, so the GPU path only worked if ORT_DYLIB_PATH/LD_LIBRARY_PATH were exported by hand. Discover them from the installed onnxruntime-gpu and nvidia wheels and inject them into the subprocess, and pin onnxruntime-gpu so a fresh environment can reproduce the GPU build.

CI resolved ort 2.0.0-rc.12 (Cargo.lock is gitignored, and the requirement permitted any rc) while local builds used rc.11, so the build broke only in CI. rc.12 made two breaking changes to the SessionBuilder chain that load_session now hits: - with_optimization_level / with_execution_providers return Result<Self, Error<SessionBuilder>>; that error carries the builder back and is not Send/Sync, so anyhow's .context() no longer applies. Convert through the error's Display instead. - commit_from_file now takes &mut self, so the final builder must be a mutable binding. Pin ort to =2.0.0-rc.12 so CI and local checkouts resolve the same pre-release and stop diverging on its frequent breaking changes.

jonbinney · 2026-05-27T18:59:43Z

Hmmm... not sure why that test is hanging in CI. Debugging now.

The real-model selfplay parity test hangs intermittently (only under CI/contention; not reproducible in isolation across 130+ local runs). run_python used Command::output() with no timeout, so any stall blocked the test indefinitely (10+ min locally; killed at 60s on CI). Pin OMP/MKL/OPENBLAS threads to 1 for the python child (avoids torch CPU thread-pool oversubscription across the concurrent parity tests, the suspected trigger; verified parity-safe for these small models), and add a 45s per-subprocess timeout that on expiry sends SIGABRT with PYTHONFAULTHANDLER=1 to dump the child's all-thread traceback, then fails with it -- turning an indefinite hang into a fast, self-diagnosing failure that captures the stack from wherever it actually reproduces.

The CI log (PR #370 run 26534399531) shows test_real_model_selfplay_trace_and_npz_matches_python ran for 6h until GitHub's default job timeout cancelled it. My prior 45s timeout in run_python did not fire, which proves the hang isn't in the python subprocess at all -- it's in the rust side of that test (generate_rust_real_model_trace_and_write_npz, which loads an ONNX model via AlphaZeroAgent::new and runs MCTS with ORT CPU inference). Other python_consistency tests that only use run_python all passed. None of the three ORT Session::builder() call sites limited threads. ORT defaults to all CPU cores for intra-op parallelism; parallel CPU tests each loading a session oversubscribe its threadpool and intermittently deadlock -- the same OpenMP-pool deadlock pattern we already hit with torch, just for ORT. Pin intra-op threads to 1 everywhere: parity-safe (the real-model parity test still passes with exact hex match), safe for production self-play (GPU does the math; the eval pipeline gets throughput from leaf-parallel batching, not intra-op threading).

Reproduced the CI hang locally: with --all-features (which enables gpu = ort/cuda + ort/load-dynamic) and no ORT_DYLIB_PATH set, AlphaZeroAgent::new -> OnnxEvaluator::new -> Session::builder() deadlocks on a futex inside ort's dynamic-loader path, before reaching with_intra_threads / commit_from_file. Earlier "evidence" for a threadpool deadlock was wrong; the real cause is gpu/load-dynamic without a dylib path. Per-thread wchan of the hung test process: tid 3214454 state=S name=quoridor_rs-a83 wchan=futex_do_wait tid 3214455 state=S name=python_consiste wchan=futex_do_wait Two fixes: - Change the CI test step from `cargo test --all-features` to `cargo test --features binary`. CI has no GPU; enabling load-dynamic + cuda is pointless and is the source of the deadlock. With --features binary the test binary uses statically-linked CPU ORT and the previously-hanging test passes in ~7s. - Add the same ORT_DYLIB_PATH guard that eval_pipeline::load_session already has to OnnxAgent::new and OnnxEvaluator::new, so any future gpu build without ORT_DYLIB_PATH fails immediately with a clear error ("gpu feature is enabled but ORT_DYLIB_PATH is not set...") instead of silently deadlocking.

jonbinney and others added 30 commits May 18, 2026 14:46

Log total moves played to W&B in trainer

18da01f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace SmallVec depth_est placeholder with concrete capacity

5a3fda4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add tokio, futures, smallvec deps for selfplay refactor

c72ba95

Update experiment YAMLs for new self_play schema

a6e7184

Store prior_clean on each MCTS node for tree-reuse re-noising

c103936

Apply Dirichlet noise after root expansion using arena-based fn

56d4476

Lays groundwork for tree reuse: prior_clean stays untouched, and noise can be re-mixed into the new root's children after subtree promotion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add virtual-loss apply/undo helpers for leaf-parallel MCTS

fbee601

Add select_leaf_with_vl: PUCT descent that applies virtual loss

f32775d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add promote_subtree for tree reuse across moves

0d4eff7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Test post-process stage: rayon finalize matches serial reference

10642d7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

selfplay_mcts: drop vestigial to_send index, comment terminal polarity

25da9d7

Test leaf-parallel diversification (K=8 vl=3) and tree reuse

a47ac56

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

LeafParallelMCTS: clear tree on model version change

f7027c2

Add play_game_async: tokio-based self-play game runner

703be4a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove legacy eval_coordinator.rs and BatchingEvaluator

4304bb6

The new pipelined coordinator in eval_pipeline.rs replaces it entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Update comments in AlphaZeroAgent to reference new eval_pipeline

f53261f

Remove references to the now-deleted BatchingEvaluator in documentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add --profile-counters: periodic GPU/batcher/postprocess timings

ab98817

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jonbinney and others added 7 commits May 20, 2026 18:16

Add bench_rust_selfplay.sh: fixed-duration games/sec measurement

421d6f2

cargo fmt after selfplay perf refactor

65d45bf

No behavioral change; rustfmt formatting only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Remove verbose timing output

a5396d9

Silence ONNX export warnings on torch 2.12

9645be7

Bump opset_version to 18 (matches what the runtime auto-converts to) and pass dynamo=False to keep the legacy exporter that supports dynamic_axes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Relax MLP convergence tes

b748000

It fails sometimes due to randomness.

Formatting

85aa785

jonbinney requested a review from alejandromarcu May 23, 2026 18:26

jonbinney marked this pull request as ready for review May 23, 2026 18:26

jonbinney and others added 10 commits May 23, 2026 16:09

Create experiment for b9w10

1e45b6a

vibe: test CUDA inference matches CPU output

0a40da6

alejandromarcu approved these changes May 26, 2026

View reviewed changes

jonbinney added 3 commits May 27, 2026 15:38

jonbinney merged commit e59d236 into main May 28, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel self-play using Tokio#370

Parallel self-play using Tokio#370
jonbinney merged 50 commits into
mainfrom
jdb/rust-tokio-selfplay

jonbinney commented May 23, 2026 •

edited

Loading

Uh oh!

jonbinney commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jonbinney commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of changes

Instructions for trying this out

Multi-threaded rust self-play with Tokio

Virtual loss

Self-play config variables renamed/added

Uh oh!

jonbinney commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jonbinney commented May 23, 2026 •

edited

Loading