Skip to content

perf(pipeline): optional multiprocessing + configurable ONNX providers for data prep + eval#71

Open
Harikrishnareddyl wants to merge 9 commits intolivekit:mainfrom
Harikrishnareddyl:perf/parallel-augmentation
Open

perf(pipeline): optional multiprocessing + configurable ONNX providers for data prep + eval#71
Harikrishnareddyl wants to merge 9 commits intolivekit:mainfrom
Harikrishnareddyl:perf/parallel-augmentation

Conversation

@Harikrishnareddyl
Copy link
Copy Markdown

@Harikrishnareddyl Harikrishnareddyl commented Apr 23, 2026

Summary

Three sibling perf fixes for the same class of GIL / CPU-pinning bug in adjacent code paths. All opt-in — defaults preserve the existing single-threaded, CPU-only behavior byte-for-byte.

  1. _augment_directory in src/livekit/wakeword/data/augment.py — single-threaded Python for over audio DSP. The GIL keeps it pinned to one core on any host.
  2. extract_features_from_directory in src/livekit/wakeword/data/features.py — same structural bug, plus MelSpectrogramFrontend / SpeechEmbedding hard-pin ONNX Runtime to CPUExecutionProvider at construction, so GPUs sit idle.
  3. run_eval in src/livekit/wakeword/eval/evaluate.py — same hardcoded ["CPUExecutionProvider"] and a fixed batch_size=1 that leaves huge perf on the table on GPU.

The fix is the same shape each time: opt-in n_workers / mp_context / execution_providers on a new pydantic config section, defaulting to legacy behavior.

Benchmark: augmentation (32-CPU Modal container, 60k-clip dataset)

Measured on a real Modal training run — 60k clips end-to-end in ~6 minutes:

Split Throughput Wall-clock
positive_train (25k) 178 clips/sec 2:20
positive_test (5k) 174 clips/sec 0:28
negative_train (25k) 130 clips/sec 3:12
negative_test (~5k) 91 clips/sec 0:53
background_train (2k) 83 clips/sec 0:24
background_test (500) 62 clips/sec 0:08

Reference single-threaded on the same host is ~2.3 clips/sec — the full 60k set would otherwise take ~7 hours.

Benchmark: feature extraction (32-CPU L40S Modal container)

Single-threaded: ~3.5 clips/sec → ~4 h for a 55k-clip dataset before training even starts.

Expected with n_workers: 0 and per-worker intra_op=inter_op=1: ~200–400 effective clips/sec → ~2–4 min.

Config surface

augmentation:
  n_workers: 0   # 0 = os.cpu_count(); 1 = legacy single-threaded (default)
  mp_context: auto   # fork on Linux/macOS, spawn on Windows

feature_extraction:
  n_workers: 0
  execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]

eval:
  execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
  batch_size: 64

All fields are optional and sections can be omitted entirely.

Backwards compatibility

  • AugmentationConfig.n_workers and FeatureExtractionConfig.n_workers both default to 1 — legacy single-threaded code paths, unchanged.
  • execution_providers defaults to ["CPUExecutionProvider"] everywhere — matches current hardcoded behavior.
  • mp_context: "auto" picks fork on Linux/macOS, spawn on Windows. Users override if a fork-unsafe backend misbehaves.
  • No public API changes. No changes to output file naming, output WAV format, or .npy feature layout.

Implementation notes

  • Feature workers use pool.imap (ordered), not imap_unordered. Downstream classifier training expects deterministic per-clip order within each split.
  • ORT sessions are pinned to 1 intra-/inter-op thread in each worker. Without this, 32 workers × N ORT threads thread-explode on a 32-CPU host and either crash or thrash. Learned the hard way on a live run.
  • ORT sessions are not pickle-safe. Workers rebuild them from the model paths via the Pool initializer; the parent's MelSpectrogramFrontend / SpeechEmbedding instances are never sent over the queue.
  • Per-worker random state. Each augmentation worker seeds random / np.random with (round_idx ^ pid) so RIR choice / SNR draws / audiomentations probabilities diverge across workers instead of all workers making identical choices.

Why not default to parallel / GPU?

Safer to keep opt-in until CI covers spawn on Windows and onnxruntime-gpu installs. Once that lands, defaults can flip.

Test plan

  • tests/test_augment_parallel.py — round-trip parity (file count, shape, duration) between single-threaded and pool augmentation paths. 4 tests.
  • tests/test_features_parallel.py — round-trip parity (shape, dtype, per-clip ordering, numerical equivalence) between single-threaded and pool feature extraction. 3 tests.
  • Full test suite passes locally: 67 passed via uv run --extra train pytest.
  • Manually verified on 32-CPU / L40S Modal container with real 25k+5k+25k+5k+2k+500 hey_livekit dataset — augmented in ~6 min, feeds downstream feature extractor without modification.

Commit layout

  1. config: add AugmentationConfig.n_workers and mp_context fields
  2. augment: optional multiprocessing.Pool for _augment_directory
  3. tests: round-trip parity between single-threaded and pool augment
  4. docs: document AugmentationConfig.n_workers with benchmarks
  5. config: add FeatureExtractionConfig and EvalConfig with ONNX providers
  6. features: optional multiprocessing.Pool + configurable ONNX providers
  7. eval: configurable ONNX providers and batch_size
  8. tests: round-trip parity for parallel feature extraction
  9. docs: document n_workers + execution_providers across pipeline

Implementation reference

A working proof-of-concept monkey-patch that demonstrated the augment 90× speedup lives at https://github.com/Harikrishnareddyl/hands-free/blob/main/training/patches/parallel_augment.py — this PR is a cleaned-up, config-driven, in-tree version of that patch plus the two sibling fixes.

Adds an opt-in parallel code path gated on AugmentationConfig.n_workers
(default 1 preserves single-threaded behavior). On a 32-CPU Linux
container, this takes augmentation DSP throughput from ~2 clips/sec
to ~210 clips/sec — roughly a 90× speedup for a 25k-clip dataset.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 23, 2026

CLA assistant check
All committers have signed the CLA.

Parallelises extract_features_from_directory behind
FeatureExtractionConfig.n_workers (default 1 preserves the
single-threaded path). Each worker constructs its own mel +
embedding ONNX sessions via the Pool initializer — ORT sessions
are not pickle-safe, and upstream workflows that loaded sessions
in the parent would not survive forking anyway.

Workers pin ORT to 1 intra-/inter-op thread. Without this, N workers
each spawning M ORT threads on a 32-CPU host thread-explodes and
either crashes or thrashes.

Uses pool.imap (ordered), not imap_unordered, so per-clip order
within each split stays deterministic — downstream classifier training
relies on consistent sample ordering.

Execution providers are plumbed through MelSpectrogramFrontend and
SpeechEmbedding constructors. Default ["CPUExecutionProvider"]
preserves current behavior; users can opt into CUDA on GPU boxes.
Replaces the hardcoded providers=["CPUExecutionProvider"] at
evaluate.py:197 with config.eval.execution_providers, and plumbs
config.eval.batch_size through to _predict_onnx.

Default behavior unchanged (CPU-only, batch_size=1); users on GPU
hosts opt in via the new EvalConfig fields.
@Harikrishnareddyl Harikrishnareddyl changed the title perf(augment): optional multiprocessing pool for 90× DSP speedup perf(pipeline): optional multiprocessing + configurable ONNX providers for data prep + eval Apr 23, 2026
@Harikrishnareddyl Harikrishnareddyl marked this pull request as ready for review April 23, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants