Potential future optimizations and features, roughly ordered by expected impact.
Implemented: CandidatePool with AtomicBool rejected bitset + post-acceptance
parallel sweep. Scan is now O(N) atomic loads; all embedding work is amortized over
sweeps. Flat arrays (fingerprints, rejected) pinned in physical RAM via
mlock/VirtualLock. See CHANGELOG pass 4.
Expected impact: Massive (explores 7× more trees, finds much longer sequences)
Memory budget for a 32 GB machine:
| max-nodes | Candidate trees | Heap (trees) | Fingerprints | Rejected bitset | Total |
|---|---|---|---|---|---|
| 9 | 3.5 M | ~635 MB | ~57 MB | ~3.4 MB | ~696 MB |
| 10 | 24.5 M | ~4.4 GB | ~398 MB | ~23.4 MB | ~4.8 GB |
| 11 | 171 M | ~30 GB | ~2.8 GB | ~163 MB | ~33 GB |
Running --max-nodes 10 is fully within budget (~5 GB). Running --max-nodes 11
requires ~33 GB and is borderline. The flat arrays (fingerprints, rejected) are
already locked in physical RAM via mlock/VirtualLock (pass 4); just raise the
--max-nodes flag. The tree heap (~4.4 GB) is not locked and may page under load —
acceptable since it is accessed sequentially during sweeps.
Expected impact: High | Complexity: High
gen_combos_cached (which enumerates child-subtree partitions) is still
sequential due to its recursive &mut cache borrow. Once the cache for
sizes 1..n-1 is fully populated, the combo enumeration could be parallelized
per first-child size: each starting size sz (1..remaining) is independent.
This would require restructuring the function to take &cache (immutable) and
accumulate into a parallel fold. The combo enumeration accounts for the remaining
single-threaded work in the pre-warm phase.
Expected impact: Small-Medium | Complexity: Medium
The label multiset check in embedding.rs (label_multiset_fits) and
fingerprint.rs (compatible) iterate over a fixed 8-element array.
With std::simd (stabilised in Rust nightly), this could be a single 8-lane
u8 comparison and subtraction, reducing the label check to ~2 instructions.
Would benefit most at very high candidate counts (tens of millions).
Expected impact: Medium | Complexity: Low
In the inner loop sequence.iter().any(|entry| fingerprint + embeds), accepted
trees that reject the most candidates should be checked first. Track per-entry
rejection counts and re-sort the accepted sequence by descending rejection rate
after each position. The hottest rejectors move to the front and short-circuit
the any() earlier.
Expected impact: Medium | Complexity: Medium
Trees with ≤ 10 nodes and ≤ 3 labels can be encoded in ~32 bytes as a flat
struct (labels: [u8; 10], n_children: [u8; 10], children: [[u8; 9]; 10],
subtree_sizes: [u8; 10]). Eliminating Vec allocations per node would:
- Improve cache locality during the embedding check
- Make the candidate list a flat
Vec<CompactTree>with no pointer chasing - Allow stack-only embedding checks
Expected impact: Low-Medium depending on sequence length | Complexity: Low
sequence.par_iter().any() for the inner loop was tested and showed no gain
when the outer candidates.par_iter().find_first() already saturates all threads
(rayon processes nested tasks inline, effectively sequentially). However, for very
long sequences (100+ accepted trees) where the inner loop becomes the bottleneck,
a dedicated thread pool split (e.g., outer uses half the threads, inner uses the
other half) could help. Worth revisiting once sequences exceed ~50 trees.
Expected impact: Medium | Complexity: Medium
Cache embeds(P, C) results in HashMap<(canonical_P, canonical_C), bool>.
Useful if the same (accepted tree, candidate) pair is evaluated multiple times
across positions. Currently each candidate is unique (deduped by canonical form)
and only evaluated once, so the cache hit rate is low for the standard case. Could
become useful with multi-pass strategies or when re-evaluating trees after pruning.
Implemented: --strategy random + --seed N flag. Uses an internal xorshift64
RNG (no external dependency). Omitting --seed produces a time-based seed printed
to stderr for reproducibility. Pool backing order is largest-first; random pick is
applied at scan time via find_random_live().
Expected impact: Potentially high | Complexity: Medium
At each position, among all live candidates, pick the one that would eliminate the fewest remaining candidates if accepted — i.e., the tree whose post-acceptance sweep marks the smallest number of rejections. This "keep options open" heuristic may produce significantly longer sequences than greedy-largest by deferring constraint tightening as long as possible.
Implementation: for each live candidate C, simulate sweep(C) and count new
rejections (without committing); pick the C with the minimum count. Cost is
O(live² × embedding) per position in the worst case, but the fingerprint gate
makes most sweeps very cheap. Can be bounded by evaluating only the top-K
candidates by size to limit cost.
Expected impact: Medium | Complexity: Low (same code as EXPLORE-2)
Mirror of EXPLORE-2: pick the candidate whose sweep eliminates the most remaining candidates. Burns through the pool aggressively early, potentially reaching a natural termination faster but with a longer sequence up to that point (large trees accepted early → fewer conflicts later). Shares almost all implementation with EXPLORE-2; just flip the min/max comparison.
Expected impact: Highest | Complexity: High
Maintain W partial sequences simultaneously. At each step, expand each beam by evaluating its top candidates; keep the W longest-running sequences overall. Properly explores the branching structure of the search space rather than committing greedily to one path.
Implementation requires W independent CandidatePool instances (each with its own
rejected bitset reflecting that beam's acceptance history). Memory cost: W ×
(fingerprint array + rejected bitset) ≈ W × 420 MB at max-nodes=10. With W=4 that
is ~1.7 GB of flat arrays plus W copies of the tree heap.
Suggested approach: restructure generate_sequence into a Beam struct holding
a Vec<(Vec<SequenceEntry>, CandidatePool)>; at each step, each beam picks its
own best candidate, sweeps its own pool, and the beams are ranked by current
length. Beams that terminate early are dropped; new beams can be seeded from the
longest surviving one.
Expected impact: High practical utility | Complexity: Very low
Rewrite partial.json (same format as sequence.json) after every acceptance,
alongside the existing per-tree SVG writes. If the run is interrupted (Ctrl+C,
OOM, timeout), the file on disk reflects the longest sequence found so far.
The existing on_found callback already has everything needed; just add a
fs::write call mirroring the overview SVG rewrite.
Serve the generated SVGs as a single-page HTML file with a grid layout, zoom, and click-to-highlight embedding relationships between trees.
Export trees in DOT format for use with dot, neato, or other Graphviz renderers.
Useful for larger trees where the current SVG layout needs more sophisticated
edge routing.
Given two tree indices, render both side-by-side and visually highlight the homeomorphic embedding mapping (if one exists) with colored arrows.
Generate an animated SVG or GIF showing trees being added to the sequence one by one, with embedding rejection visualizations.
The embedding check is a recursive backtracking algorithm with:
- Divergent branching per tree pair (bad for GPU SIMD warps)
- Dynamic stack allocation during recursion (no heap on GPU)
- Irregular memory access via linked tree structures
- Very small trees (≤ 10 nodes) — GPU dispatch overhead would dominate
The data-parallel parts (label filter, size filter) are already nanosecond-fast on CPU. GPU would add PCIe transfer latency for 3.5M trees with no meaningful computation to amortize it. Not viable for this problem structure.
Tested: sequence.par_iter().any() inside candidates.par_iter().find_first().
No speedup observed. When the outer par_iter saturates all rayon threads, inner
par_iter tasks are processed inline (sequentially) by the current thread.
Rayon's work-stealing cannot create more threads than the pool size.
See [PERF-5] for a viable alternative once sequences are long enough.