Skip to content

Parallel self-play using Tokio#370

Merged
jonbinney merged 50 commits into
mainfrom
jdb/rust-tokio-selfplay
May 28, 2026
Merged

Parallel self-play using Tokio#370
jonbinney merged 50 commits into
mainfrom
jdb/rust-tokio-selfplay

Conversation

@jonbinney

@jonbinney jonbinney commented May 23, 2026

Copy link
Copy Markdown
Owner

Summary of changes

  • multi-threaded self-play in rust, using the tokio asynchronous library
  • evaluation cache in rust self-play. This is shared across all worker threads and all games.
  • "virtual loss" so that MCTS search can be run multiple times on a single tree without waiting for each evaluation
  • After choosing a move, the rust alphazero agent now reuses that part of the MCTS subtree to save work on the next move.
  • Renamed self-play config variables that control parallelism so that they make sense for both the rust and the python multithread/multiprocess designs
  • Updated the rust package to use the 2024 edition of rust instead of the 2021 edition.
  • included the design and plan docs that the claude superpowers plugin created. @adamantivm will like this :-)

Instructions for trying this out

  • Download this config file
  • Create a python virtual environment, activate it, and run pip install -r deep_quoridor/requirements.txt
  • In the deep_quoridor/rust directory, run cargo build --release --all-features --bin selfplay
  • From the root directory of the repo, run python -O deep_quoridor/src/train_v2.py <path to downloaded config.yaml> --overrides run_id=<unique name for the run>

Multi-threaded rust self-play with Tokio

There are now a fixed number of worker threads created, and any of them can work on the MCTS tree of whichever game is ready. This lets you choose a number of threads based on the number of cores in the CPU, and choose a number of parallel games to take advantage of the efficiency of batch evaluations. Interestingly, Claude implemented this in a way that immediately picks up a new model version, even mid-game.

Virtual loss

We want to be able to run MCTS searches on a tree without waiting for the evaluator each time. The problem is that every search will choose the same path through the tree, since the total value and visit counts will not have changed. Virtual loss fixes this by having each search temporarily back-propagate a fixed number of losses (defaults to 3, which alphazero used). This makes that path through the tree less attractive to the next thread. Once the evaluations are ready, the virtual losses are removed, and the actual evaluations are back propagated.

If the number of searches done before evaluations are ready (call this K) is too large, then the overall statistics of MCTS will be thrown off. Anecdotally, using K=16 for mcts_n=400 seems to cause no noticeable decrease in performance.

If we want to we could try techniques that are newer/better than virtual loss, like "virtual mean"

Self-play config variables renamed/added

  • num_workers is now num_processes to be more explicit. I've been always using 1 rust selfplay process, to take advantage of the shared evaluator cache.
  • parallel_games is now games_per_process to make it clearer. In python, each self-play process has exactly one thread that handles all of those games. In rust there can be more than one thread.
  • mcts_worker_threads is only used in rust at the moment. This should be set as high as the CPU can handle. Each of these threads can work on any of the games.
  • leaf_parallelism is the number of searches that can be done on a single MCTS tree before waiting for evaluations from the GPU. (virtual loss values are used until the evaluations are ready)
  • virtual_loss is the number of losses temporarily backpropagated for each node waiting on an evaluation.
  • enable_tree_reuse: if true, the mcts subtree of the chosen move is reused for the next move.
  • eval_cache_max_size is used to limit the evaluator cache size. one million is a pretty safe value; set it as high as you can depending on the amount of memory the machine has.
  • eval_batch_size: the evaluator batches up evaluation requests so that it can send them all to the GPU together. this is the maximum number of evaluations it can send to the GPU at once. setting this larger increases evaluation bandwidth, but also increases latency. if the latency is too high, mcts search threads running on the CPU will be blocked while waiting for the evaluation.
  • eval_max_wait_ms: we don't want to wait forever for a full size batch. After this many milliseconds we just run whatever batch is ready to go.

jonbinney and others added 30 commits May 18, 2026 14:46
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run many self-play games concurrently in one Rust process by routing
all NN evaluations through a single eval coordinator thread that batches
requests and serves a shared DashMap eval cache keyed on CompactState.

- New `self_play.rust` config block + matching `--num-threads`,
  `--games-per-thread`, `--eval-batch-size`, `--eval-max-wait-ms`,
  `--eval-cache-max-size` CLI flags. Defaults reproduce sequential
  behavior (1 worker, batch-of-1, no cache).
- `eval_coordinator.rs`: owns the ORT session; collects up to
  eval_batch_size requests bounded by eval_max_wait_ms from the first
  request, runs one batched inference, inserts into the cache up to
  the configured cap (no eviction), and routes results back via
  oneshot channels. Handles ReloadModel (drop session, clear cache,
  load new) and Shutdown control messages.
- BatchingEvaluator (new Evaluator impl): looks up the shared cache,
  builds the rotated input tensor in the requesting worker thread on
  miss, sends an EvalRequest, blocks on the response.
- AlphaZeroAgent now holds Box<dyn Evaluator + Send> via a new
  with_evaluator constructor so workers can plug in BatchingEvaluator
  while the legacy path keeps using OnnxEvaluator.
- mcts::search takes &mut dyn Evaluator so the boxed agent works.
- --use-raw-onnx-agent keeps the legacy single-threaded path unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
parallel_games (Python) and games_per_thread (Rust) named the same
concept: concurrent games per self-play process. Unify on the
games_per_thread name and hoist the Rust-only eval-coordinator knobs
(num_threads, eval_batch_size, eval_max_wait_ms, eval_cache_max_size)
up to the top of self_play so both sides parse one flat block.

Python (train_v2):
- v2/config.py: SelfPlayConfig renamed parallel_games -> games_per_thread;
  added num_threads/eval_batch_size/eval_max_wait_ms/eval_cache_max_size
  with defaults that preserve current Python behavior.
- v2/self_play.py, tune_selfplay.py, test/config_test.py: renamed.

Rust:
- selfplay_config.rs: dropped the self_play.rust sub-section; moved its
  fields up to SelfPlayWorkerConfig alongside the renamed
  games_per_thread.
- bin/selfplay.rs: ResolvedRustConfig now reads SelfPlayWorkerConfig
  directly.

YAMLs: parallel_games -> games_per_thread in experiments/*.yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ds_per_process

Makes the YAML field names match what they actually control: num_processes
is the count of self-play subprocesses, threads_per_process is the per-process
worker-thread count inside Rust self-play.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Async tokio worker model, leaf-parallel MCTS with virtual loss, pipelined
eval coordinator, tree reuse across moves. Targets ≥4x games/sec on the
reference hardware (RTX 5080 + 12c/24t) by saturating both CPU and GPU.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
24 tasks covering config schema changes, MCTS extensions (virtual loss,
tree reuse, in-arena Dirichlet noise), async pipelined eval coordinator,
LeafParallelMCTS, async game runner, selfplay binary rewiring, profile
counters, benchmark script, and manual perf verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops threads_per_process/games_per_thread; adds games_per_process,
leaf_parallelism, virtual_loss, enable_tree_reuse, mcts_worker_threads.
Raises eval_cache_max_size default from 0 to 100000.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror Python schema changes: drops threads_per_process / games_per_thread,
adds games_per_process, leaf_parallelism, virtual_loss, enable_tree_reuse,
mcts_worker_threads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lays groundwork for tree reuse: prior_clean stays untouched, and noise
can be re-mixed into the new root's children after subtree promotion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three dedicated OS threads connected by capacity-1 sync_channels, fronted
by a tokio mpsc. finalize_policy is parallelized via rayon in the
post-process stage. Old eval_coordinator.rs remains until cleanup task.

Also fix pre-existing selfplay.rs compile error: SelfPlayWorkerConfig
no longer has threads_per_process/games_per_thread after the Task 3
rework; map games_per_thread YAML fallback to games_per_process.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-wait

- Add one-time cache-saturation warning when cache.len() >= cache_max
- Add debug_assert to verify batch_len > 0 before division
- Document zero-wait (max_wait=0) semantics forcing batch-size-1
- Annotate PostIn::Reload handling with context comment
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Owns its arena across calls so subsequent task can enable tree reuse.
Each outer iteration selects up to K leaves with virtual loss, classifies
them as Terminal/Hit/Miss, sends miss requests to the eval pipeline, then
awaits in selection order while undoing VL + backpropping real results.
Stub coordinator emulates the eval pipeline (uniform priors, value=0).
With K=1, vl=0, no noise, no tree reuse, the leaf-parallel search behaves
like sequential MCTS and we verify the configured iteration count is run.

Manual tokio runtime via Builder rather than #[tokio::test] because the
workspace sets panic = "abort" in [profile.dev], which interacts badly
with the tokio::test macro.

The mcts instance holds an internal Sender clone, so it must be dropped
before tx so the stub's rx.recv() returns None and the stub task exits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both batch and continuous modes now use the new pipelined eval coordinator,
spawn one async game task per concurrent game, and rely on LeafParallelMCTS
for per-game search. Old BatchingEvaluator/coordinator scaffolding is no
longer referenced; the legacy --use-raw-onnx-agent path is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The new pipelined coordinator in eval_pipeline.rs replaces it entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove references to the now-deleted BatchingEvaluator in documentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- selfplay.rs: Updated doc comment to reflect tokio + leaf-parallel MCTS architecture
- self_play.py: Renamed games_per_thread to games_per_process
- tune_selfplay.py: Renamed all occurrences of games_per_thread to games_per_process across function params, dict keys, CLI help text, and override strings
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jonbinney and others added 7 commits May 20, 2026 18:16
No behavioral change; rustfmt formatting only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Bump Cargo.toml edition from 2021 to 2024.
- Escape the now-reserved `gen` keyword: `rng.gen()` → `rng.r#gen()`.
- Fix three rand callsites in minimax.rs, selfplay_game.rs, agent.rs.
- Tighten two `apply_dirichlet_noise` / `expand_node` filter patterns
  (`|(_, &p)|` → `|&(_, &p)|`) — edition 2024 forbids explicit deref
  inside an implicitly-borrowing pattern.
- Allow `unsafe_op_in_unsafe_fn` crate-wide in lib.rs to suppress the
  edition-2024 warnings emitted by pyo3 0.22's macro-generated code
  (121 warnings before, 0 after).

cargo check, cargo build --release, and cargo test --features binary
all run clean (153 tests pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump opset_version to 18 (matches what the runtime auto-converts to)
and pass dynamo=False to keep the legacy exporter that supports
dynamic_axes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
It fails sometimes due to randomness.
@jonbinney jonbinney requested a review from alejandromarcu May 23, 2026 18:26
@jonbinney jonbinney marked this pull request as ready for review May 23, 2026 18:26
jonbinney and others added 10 commits May 23, 2026 16:09
Self-play is unusably slow for 9x9/10-wall. Profiling showed the
resnet runs entirely on CPU (GPU at 7-9%, no selfplay compute process)
because load_session registers no execution provider and ort has no
cuda feature. Spec captures the root cause, the CUDA-EP-with-CPU-
fallback fix, and a Blackwell sm_120 de-risking spike to run first.
Step-by-step plan to register the ONNX Runtime CUDA execution
provider in load_session behind a gpu Cargo feature, with a CPU
fallback, an sm_120 go/no-go benchmark gate (with a load-dynamic
contingency), a CUDA-vs-CPU numerics test, and a results writeup.
Self-play resnet inference ran entirely on CPU because load_session
registered no execution provider. Add a gpu Cargo feature (ort/cuda)
and, behind it, register the CUDA execution provider with a CPU
fallback plus Level3 graph optimization. CPU-only builds are
unaffected.
The bundled ORT CUDA provider misses libcudnn.so.9 at runtime (not in
system lib path). Switching to ort/load-dynamic lets us point at the
onnxruntime-gpu 1.26.0 wheel installed in the venv, which properly
activates CUDAExecutionProvider on the RTX 5080 (sm_120/Blackwell).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Captures the full before/after benchmark: CPU baseline ~7,600
evals/sec vs GPU steady-state ~38,000 evals/sec (~5× speedup) on
the RTX 5080. Documents Path B (load-dynamic + onnxruntime-gpu
1.26.0) required because bundled ORT CUDA 12 libs are mismatched
against the host CUDA 13 runtime. Notes required env vars for GPU
inference and flags train_v2.py subprocess env inheritance as a
follow-up. Identifies CPU-side MCTS pathfinding as the next
bottleneck (GPU at 20-31% util, batch_wait_ms=0).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A gpu-feature selfplay binary uses ort/load-dynamic and panics or
silently falls back to CPU when ORT_DYLIB_PATH is absent. Bail with a
clear message at session-load time so the operator knows exactly what
to set instead of hunting through ORT internals.
train_v2.py spawned selfplay without an explicit env, so the GPU path
only worked if ORT_DYLIB_PATH/LD_LIBRARY_PATH were exported by hand.
Discover them from the installed onnxruntime-gpu and nvidia wheels and
inject them into the subprocess, and pin onnxruntime-gpu so a fresh
environment can reproduce the GPU build.
CI resolved ort 2.0.0-rc.12 (Cargo.lock is gitignored, and the
requirement permitted any rc) while local builds used rc.11, so the
build broke only in CI. rc.12 made two breaking changes to the
SessionBuilder chain that load_session now hits:

- with_optimization_level / with_execution_providers return
  Result<Self, Error<SessionBuilder>>; that error carries the builder
  back and is not Send/Sync, so anyhow's .context() no longer applies.
  Convert through the error's Display instead.
- commit_from_file now takes &mut self, so the final builder must be a
  mutable binding.

Pin ort to =2.0.0-rc.12 so CI and local checkouts resolve the same
pre-release and stop diverging on its frequent breaking changes.
@jonbinney

Copy link
Copy Markdown
Owner Author

Hmmm... not sure why that test is hanging in CI. Debugging now.

jonbinney added 3 commits May 27, 2026 15:38
The real-model selfplay parity test hangs intermittently (only under
CI/contention; not reproducible in isolation across 130+ local runs).
run_python used Command::output() with no timeout, so any stall blocked
the test indefinitely (10+ min locally; killed at 60s on CI).

Pin OMP/MKL/OPENBLAS threads to 1 for the python child (avoids torch CPU
thread-pool oversubscription across the concurrent parity tests, the
suspected trigger; verified parity-safe for these small models), and add
a 45s per-subprocess timeout that on expiry sends SIGABRT with
PYTHONFAULTHANDLER=1 to dump the child's all-thread traceback, then fails
with it -- turning an indefinite hang into a fast, self-diagnosing
failure that captures the stack from wherever it actually reproduces.
The CI log (PR #370 run 26534399531) shows
test_real_model_selfplay_trace_and_npz_matches_python ran for 6h until
GitHub's default job timeout cancelled it. My prior 45s timeout in
run_python did not fire, which proves the hang isn't in the python
subprocess at all -- it's in the rust side of that test
(generate_rust_real_model_trace_and_write_npz, which loads an ONNX
model via AlphaZeroAgent::new and runs MCTS with ORT CPU inference).
Other python_consistency tests that only use run_python all passed.

None of the three ORT Session::builder() call sites limited threads.
ORT defaults to all CPU cores for intra-op parallelism; parallel CPU
tests each loading a session oversubscribe its threadpool and
intermittently deadlock -- the same OpenMP-pool deadlock pattern we
already hit with torch, just for ORT. Pin intra-op threads to 1
everywhere: parity-safe (the real-model parity test still passes with
exact hex match), safe for production self-play (GPU does the math; the
eval pipeline gets throughput from leaf-parallel batching, not intra-op
threading).
Reproduced the CI hang locally: with --all-features (which enables
gpu = ort/cuda + ort/load-dynamic) and no ORT_DYLIB_PATH set,
AlphaZeroAgent::new -> OnnxEvaluator::new -> Session::builder()
deadlocks on a futex inside ort's dynamic-loader path, before reaching
with_intra_threads / commit_from_file. Earlier "evidence" for a
threadpool deadlock was wrong; the real cause is gpu/load-dynamic
without a dylib path. Per-thread wchan of the hung test process:

  tid 3214454 state=S name=quoridor_rs-a83 wchan=futex_do_wait
  tid 3214455 state=S name=python_consiste wchan=futex_do_wait

Two fixes:

- Change the CI test step from `cargo test --all-features` to
  `cargo test --features binary`. CI has no GPU; enabling load-dynamic
  + cuda is pointless and is the source of the deadlock. With
  --features binary the test binary uses statically-linked CPU ORT and
  the previously-hanging test passes in ~7s.

- Add the same ORT_DYLIB_PATH guard that eval_pipeline::load_session
  already has to OnnxAgent::new and OnnxEvaluator::new, so any future
  gpu build without ORT_DYLIB_PATH fails immediately with a clear error
  ("gpu feature is enabled but ORT_DYLIB_PATH is not set...") instead
  of silently deadlocking.
@jonbinney jonbinney merged commit e59d236 into main May 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants