Could the tree sequence search run entirely on the GPU and its VRAM, like a video game?
- Fingerprint compatibility check (
TreeFingerprint::compatible) — 17-byte struct comparison across all N candidates simultaneously. Pure data-parallel, no branching. Ideal GPU workload. - Rejection bitset scan (
find_first_live) — reading NAtomicBools and finding the first unset one is a parallel reduction. GPUs do this fast. - Post-acceptance sweep (
CandidatePool::sweep) — currently uses rayon (CPU threads); the same pattern maps naturally to a GPU compute dispatch. Each candidate is independent.
The two hot flat arrays (fingerprints, rejected) are already contiguous, fixed-size, and layout-friendly for GPU memory.
- Recursive backtracking (
embeds/match_children) — irregular, variable-depth recursion with dynamic branching per tree pair. GPUs execute in lockstep warps; divergent control flow causes most threads to stall. This is the worst possible workload for a GPU. - Dynamic heap allocation —
Treeclones useVec<Node>(heap-allocated, variable size). GPU memory has no malloc equivalent per thread. - Tree generation phase (
gen_combos_cached) — deeply recursive partition enumeration with a shared mutable memoization cache. Fundamentally sequential; no GPU mapping.
Keep CPU responsible for:
- Tree generation and canonicalization
- The recursive
embedscheck (backtracking) - The memoization cache
Offload to GPU:
- Fingerprint pre-filter sweep over all N candidates (GPU compute shader)
- Rejection bitset compaction / scan
Libraries that could enable this in Rust:
wgpu— cross-platform GPU compute (WebGPU API); no CUDA requiredcudarc— CUDA kernels from Rust; NVIDIA onlyopencl3— OpenCL; cross-vendor
Running the embedding check on GPU would require:
- Iterative stack-based embedding — rewrite
match_childrenas an explicit stack (no recursion; GPU shaders have no call stack). - Flat tree encoding — trees are already
Vec<Node>with index-based children; this is the right format. Would need to pack into a GPU buffer as a struct-of-arrays. - GPU-side work queue — a persistent kernel with a device-side queue dispatching embedding checks as candidates are found live.
- Warp divergence mitigation — group candidates by tree size/shape before dispatching to reduce branch divergence within a warp.
This is feasible as a research project but is a substantial rewrite. The embedding checker would need to be redesigned from scratch for SIMT execution.
| Component | GPU feasible? | Notes |
|---|---|---|
| Fingerprint sweep | Yes, straightforward | Flat array, 17-byte compare, no branching |
| Rejection bitset scan | Yes | Parallel reduction |
embeds recursive check |
No without major rewrite | Divergent recursion, dynamic allocation |
| Tree generation | No | Sequential recursive enumeration |
| Full GPU execution | Possible (research) | Requires iterative embedding + flat encoding |