Skip to content

Improve parallelization: threading backend, PI batching, auto-selection#91

Merged
monte-flora merged 1 commit into
masterfrom
improve/performance-optimization
Apr 2, 2026
Merged

Improve parallelization: threading backend, PI batching, auto-selection#91
monte-flora merged 1 commit into
masterfrom
improve/performance-optimization

Conversation

@monte-flora
Copy link
Copy Markdown
Owner

  • Switch run_parallel from loky to threading backend — eliminates pickling overhead for feature-level parallelism. Sklearn predict releases the GIL, so threads scale well (tested up to 10 cores).
  • Add auto-selection: skip parallel overhead when task count < 3
  • Batch n_permute predict calls in permutation importance into single predict call (PI 10v/10p: 78.6s → 64.2s serial, 19.1s at n_jobs=4)
  • Optimize PI column reassembly with numpy in-place column swap instead of pd.concat column-by-column extraction
  • Add stress test to benchmark suite (10K samples, 30 features, 100 trees)

Scaling results (10K samples, 30 features, 100-tree RF):
ALE (30 feat, 1 boot): 1.67s → 0.36s at n_jobs=10 (4.7x)
ALE (30 feat, 10 boot): 16.3s → 2.7s at n_jobs=10 (6.1x)
PD (5 feat, 10 boot): 29.7s → 7.4s at n_jobs=5 (4.0x)
PI (10 vars, 10 perm): 66.4s → 19.1s at n_jobs=4 (3.5x)

- Switch run_parallel from loky to threading backend — eliminates
  pickling overhead for feature-level parallelism. Sklearn predict
  releases the GIL, so threads scale well (tested up to 10 cores).
- Add auto-selection: skip parallel overhead when task count < 3
- Batch n_permute predict calls in permutation importance into single
  predict call (PI 10v/10p: 78.6s → 64.2s serial, 19.1s at n_jobs=4)
- Optimize PI column reassembly with numpy in-place column swap
  instead of pd.concat column-by-column extraction
- Add stress test to benchmark suite (10K samples, 30 features, 100 trees)

Scaling results (10K samples, 30 features, 100-tree RF):
  ALE (30 feat, 1 boot): 1.67s → 0.36s at n_jobs=10 (4.7x)
  ALE (30 feat, 10 boot): 16.3s → 2.7s at n_jobs=10 (6.1x)
  PD (5 feat, 10 boot): 29.7s → 7.4s at n_jobs=5 (4.0x)
  PI (10 vars, 10 perm): 66.4s → 19.1s at n_jobs=4 (3.5x)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@monte-flora monte-flora merged commit a353d2f into master Apr 2, 2026
11 checks passed
@monte-flora monte-flora deleted the improve/performance-optimization branch April 2, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant