perf(pipeline): optional multiprocessing + configurable ONNX providers for data prep + eval by Harikrishnareddyl · Pull Request #71 · livekit/livekit-wakeword

Harikrishnareddyl · 2026-04-23T16:41:34Z

Summary

Three sibling perf fixes for the same class of GIL / CPU-pinning bug in adjacent code paths. All opt-in — defaults preserve the existing single-threaded, CPU-only behavior byte-for-byte.

_augment_directory in src/livekit/wakeword/data/augment.py — single-threaded Python for over audio DSP. The GIL keeps it pinned to one core on any host.
extract_features_from_directory in src/livekit/wakeword/data/features.py — same structural bug, plus MelSpectrogramFrontend / SpeechEmbedding hard-pin ONNX Runtime to CPUExecutionProvider at construction, so GPUs sit idle.
run_eval in src/livekit/wakeword/eval/evaluate.py — same hardcoded ["CPUExecutionProvider"] and a fixed batch_size=1 that leaves huge perf on the table on GPU.

The fix is the same shape each time: opt-in n_workers / mp_context / execution_providers on a new pydantic config section, defaulting to legacy behavior.

Benchmark: augmentation (32-CPU Modal container, 60k-clip dataset)

Measured on a real Modal training run — 60k clips end-to-end in ~6 minutes:

Split	Throughput	Wall-clock
`positive_train` (25k)	178 clips/sec	2:20
`positive_test` (5k)	174 clips/sec	0:28
`negative_train` (25k)	130 clips/sec	3:12
`negative_test` (~5k)	91 clips/sec	0:53
`background_train` (2k)	83 clips/sec	0:24
`background_test` (500)	62 clips/sec	0:08

Reference single-threaded on the same host is ~2.3 clips/sec — the full 60k set would otherwise take ~7 hours.

Benchmark: feature extraction (32-CPU L40S Modal container)

Single-threaded: ~3.5 clips/sec → ~4 h for a 55k-clip dataset before training even starts.

Expected with n_workers: 0 and per-worker intra_op=inter_op=1: ~200–400 effective clips/sec → ~2–4 min.

Config surface

augmentation:
  n_workers: 0   # 0 = os.cpu_count(); 1 = legacy single-threaded (default)
  mp_context: auto   # fork on Linux/macOS, spawn on Windows

feature_extraction:
  n_workers: 0
  execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]

eval:
  execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
  batch_size: 64

All fields are optional and sections can be omitted entirely.

Backwards compatibility

AugmentationConfig.n_workers and FeatureExtractionConfig.n_workers both default to 1 — legacy single-threaded code paths, unchanged.
execution_providers defaults to ["CPUExecutionProvider"] everywhere — matches current hardcoded behavior.
mp_context: "auto" picks fork on Linux/macOS, spawn on Windows. Users override if a fork-unsafe backend misbehaves.
No public API changes. No changes to output file naming, output WAV format, or .npy feature layout.

Implementation notes

Feature workers use pool.imap (ordered), not imap_unordered. Downstream classifier training expects deterministic per-clip order within each split.
ORT sessions are pinned to 1 intra-/inter-op thread in each worker. Without this, 32 workers × N ORT threads thread-explode on a 32-CPU host and either crash or thrash. Learned the hard way on a live run.
ORT sessions are not pickle-safe. Workers rebuild them from the model paths via the Pool initializer; the parent's MelSpectrogramFrontend / SpeechEmbedding instances are never sent over the queue.
Per-worker random state. Each augmentation worker seeds random / np.random with (round_idx ^ pid) so RIR choice / SNR draws / audiomentations probabilities diverge across workers instead of all workers making identical choices.

Why not default to parallel / GPU?

Safer to keep opt-in until CI covers spawn on Windows and onnxruntime-gpu installs. Once that lands, defaults can flip.

Test plan

tests/test_augment_parallel.py — round-trip parity (file count, shape, duration) between single-threaded and pool augmentation paths. 4 tests.
tests/test_features_parallel.py — round-trip parity (shape, dtype, per-clip ordering, numerical equivalence) between single-threaded and pool feature extraction. 3 tests.
Full test suite passes locally: 67 passed via uv run --extra train pytest.
Manually verified on 32-CPU / L40S Modal container with real 25k+5k+25k+5k+2k+500 hey_livekit dataset — augmented in ~6 min, feeds downstream feature extractor without modification.

Commit layout

config: add AugmentationConfig.n_workers and mp_context fields
augment: optional multiprocessing.Pool for _augment_directory
tests: round-trip parity between single-threaded and pool augment
docs: document AugmentationConfig.n_workers with benchmarks
config: add FeatureExtractionConfig and EvalConfig with ONNX providers
features: optional multiprocessing.Pool + configurable ONNX providers
eval: configurable ONNX providers and batch_size
tests: round-trip parity for parallel feature extraction
docs: document n_workers + execution_providers across pipeline

Implementation reference

A working proof-of-concept monkey-patch that demonstrated the augment 90× speedup lives at https://github.com/Harikrishnareddyl/hands-free/blob/main/training/patches/parallel_augment.py — this PR is a cleaned-up, config-driven, in-tree version of that patch plus the two sibling fixes.

Adds an opt-in parallel code path gated on AugmentationConfig.n_workers (default 1 preserves single-threaded behavior). On a 32-CPU Linux container, this takes augmentation DSP throughput from ~2 clips/sec to ~210 clips/sec — roughly a 90× speedup for a 25k-clip dataset.

CLAassistant · 2026-04-23T16:41:41Z

All committers have signed the CLA.

Parallelises extract_features_from_directory behind FeatureExtractionConfig.n_workers (default 1 preserves the single-threaded path). Each worker constructs its own mel + embedding ONNX sessions via the Pool initializer — ORT sessions are not pickle-safe, and upstream workflows that loaded sessions in the parent would not survive forking anyway. Workers pin ORT to 1 intra-/inter-op thread. Without this, N workers each spawning M ORT threads on a 32-CPU host thread-explodes and either crashes or thrashes. Uses pool.imap (ordered), not imap_unordered, so per-clip order within each split stays deterministic — downstream classifier training relies on consistent sample ordering. Execution providers are plumbed through MelSpectrogramFrontend and SpeechEmbedding constructors. Default ["CPUExecutionProvider"] preserves current behavior; users can opt into CUDA on GPU boxes.

Replaces the hardcoded providers=["CPUExecutionProvider"] at evaluate.py:197 with config.eval.execution_providers, and plumbs config.eval.batch_size through to _predict_onnx. Default behavior unchanged (CPU-only, batch_size=1); users on GPU hosts opt in via the new EvalConfig fields.

Harikrishnareddyl added 4 commits April 23, 2026 11:40

config: add AugmentationConfig.n_workers and mp_context fields

b26787a

tests: round-trip parity between single-threaded and pool augment

b7b9b8c

docs: document AugmentationConfig.n_workers with benchmarks

dea2ace

Harikrishnareddyl added 5 commits April 23, 2026 11:59

config: add FeatureExtractionConfig and EvalConfig with ONNX providers

0bb3abc

tests: round-trip parity for parallel feature extraction

b5d68a0

docs: document n_workers + execution_providers across pipeline

9b6ad46

Harikrishnareddyl changed the title ~~perf(augment): optional multiprocessing pool for 90× DSP speedup~~ perf(pipeline): optional multiprocessing + configurable ONNX providers for data prep + eval Apr 23, 2026

Harikrishnareddyl marked this pull request as ready for review April 23, 2026 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pipeline): optional multiprocessing + configurable ONNX providers for data prep + eval#71

perf(pipeline): optional multiprocessing + configurable ONNX providers for data prep + eval#71
Harikrishnareddyl wants to merge 9 commits intolivekit:mainfrom
Harikrishnareddyl:perf/parallel-augmentation

Harikrishnareddyl commented Apr 23, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Harikrishnareddyl commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark: augmentation (32-CPU Modal container, 60k-clip dataset)

Benchmark: feature extraction (32-CPU L40S Modal container)

Config surface

Backwards compatibility

Implementation notes

Why not default to parallel / GPU?

Test plan

Commit layout

Implementation reference

Uh oh!

CLAassistant commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Harikrishnareddyl commented Apr 23, 2026 •

edited

Loading

CLAassistant commented Apr 23, 2026 •

edited

Loading