perf(pipeline): optional multiprocessing + configurable ONNX providers for data prep + eval#71
Open
Harikrishnareddyl wants to merge 9 commits intolivekit:mainfrom
Open
Conversation
Adds an opt-in parallel code path gated on AugmentationConfig.n_workers (default 1 preserves single-threaded behavior). On a 32-CPU Linux container, this takes augmentation DSP throughput from ~2 clips/sec to ~210 clips/sec — roughly a 90× speedup for a 25k-clip dataset.
Parallelises extract_features_from_directory behind FeatureExtractionConfig.n_workers (default 1 preserves the single-threaded path). Each worker constructs its own mel + embedding ONNX sessions via the Pool initializer — ORT sessions are not pickle-safe, and upstream workflows that loaded sessions in the parent would not survive forking anyway. Workers pin ORT to 1 intra-/inter-op thread. Without this, N workers each spawning M ORT threads on a 32-CPU host thread-explodes and either crashes or thrashes. Uses pool.imap (ordered), not imap_unordered, so per-clip order within each split stays deterministic — downstream classifier training relies on consistent sample ordering. Execution providers are plumbed through MelSpectrogramFrontend and SpeechEmbedding constructors. Default ["CPUExecutionProvider"] preserves current behavior; users can opt into CUDA on GPU boxes.
Replaces the hardcoded providers=["CPUExecutionProvider"] at evaluate.py:197 with config.eval.execution_providers, and plumbs config.eval.batch_size through to _predict_onnx. Default behavior unchanged (CPU-only, batch_size=1); users on GPU hosts opt in via the new EvalConfig fields.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three sibling perf fixes for the same class of GIL / CPU-pinning bug in adjacent code paths. All opt-in — defaults preserve the existing single-threaded, CPU-only behavior byte-for-byte.
_augment_directoryinsrc/livekit/wakeword/data/augment.py— single-threaded Pythonforover audio DSP. The GIL keeps it pinned to one core on any host.extract_features_from_directoryinsrc/livekit/wakeword/data/features.py— same structural bug, plusMelSpectrogramFrontend/SpeechEmbeddinghard-pin ONNX Runtime toCPUExecutionProviderat construction, so GPUs sit idle.run_evalinsrc/livekit/wakeword/eval/evaluate.py— same hardcoded["CPUExecutionProvider"]and a fixedbatch_size=1that leaves huge perf on the table on GPU.The fix is the same shape each time: opt-in
n_workers/mp_context/execution_providerson a new pydantic config section, defaulting to legacy behavior.Benchmark: augmentation (32-CPU Modal container, 60k-clip dataset)
Measured on a real Modal training run — 60k clips end-to-end in ~6 minutes:
positive_train(25k)positive_test(5k)negative_train(25k)negative_test(~5k)background_train(2k)background_test(500)Reference single-threaded on the same host is ~2.3 clips/sec — the full 60k set would otherwise take ~7 hours.
Benchmark: feature extraction (32-CPU L40S Modal container)
Single-threaded: ~3.5 clips/sec → ~4 h for a 55k-clip dataset before training even starts.
Expected with
n_workers: 0and per-workerintra_op=inter_op=1: ~200–400 effective clips/sec → ~2–4 min.Config surface
All fields are optional and sections can be omitted entirely.
Backwards compatibility
AugmentationConfig.n_workersandFeatureExtractionConfig.n_workersboth default to1— legacy single-threaded code paths, unchanged.execution_providersdefaults to["CPUExecutionProvider"]everywhere — matches current hardcoded behavior.mp_context: "auto"picksforkon Linux/macOS,spawnon Windows. Users override if a fork-unsafe backend misbehaves..npyfeature layout.Implementation notes
pool.imap(ordered), notimap_unordered. Downstream classifier training expects deterministic per-clip order within each split.initializer; the parent'sMelSpectrogramFrontend/SpeechEmbeddinginstances are never sent over the queue.random/np.randomwith(round_idx ^ pid)so RIR choice / SNR draws / audiomentations probabilities diverge across workers instead of all workers making identical choices.Why not default to parallel / GPU?
Safer to keep opt-in until CI covers
spawnon Windows andonnxruntime-gpuinstalls. Once that lands, defaults can flip.Test plan
tests/test_augment_parallel.py— round-trip parity (file count, shape, duration) between single-threaded and pool augmentation paths. 4 tests.tests/test_features_parallel.py— round-trip parity (shape, dtype, per-clip ordering, numerical equivalence) between single-threaded and pool feature extraction. 3 tests.67 passedviauv run --extra train pytest.hey_livekitdataset — augmented in ~6 min, feeds downstream feature extractor without modification.Commit layout
config: add AugmentationConfig.n_workers and mp_context fieldsaugment: optional multiprocessing.Pool for _augment_directorytests: round-trip parity between single-threaded and pool augmentdocs: document AugmentationConfig.n_workers with benchmarksconfig: add FeatureExtractionConfig and EvalConfig with ONNX providersfeatures: optional multiprocessing.Pool + configurable ONNX providerseval: configurable ONNX providers and batch_sizetests: round-trip parity for parallel feature extractiondocs: document n_workers + execution_providers across pipelineImplementation reference
A working proof-of-concept monkey-patch that demonstrated the augment 90× speedup lives at https://github.com/Harikrishnareddyl/hands-free/blob/main/training/patches/parallel_augment.py — this PR is a cleaned-up, config-driven, in-tree version of that patch plus the two sibling fixes.