livekit · Harikrishnareddyl · Apr 23, 2026 · Apr 23, 2026 · Apr 23, 2026 · Apr 23, 2026
diff --git a/configs/prod.yaml b/configs/prod.yaml
@@ -100,6 +100,38 @@ augmentation:
   # Room impulse response directories for reverb (downloaded via `setup`)
   rir_paths: [./data/rirs]
 
+  # Parallelism for the per-clip DSP loop. Default 1 preserves the legacy
+  # single-threaded code path. On a multi-core host (e.g. 32-CPU Modal
+  # container) setting this to 0 uses all cores and gives a ~10-100x
+  # speedup for the augmentation stage.
+  # n_workers: 0   # uncomment to use all CPU cores (10–100× faster)
+
+# ============================================================================
+# Feature Extraction (all fields optional — defaults to single-threaded CPU)
+# ============================================================================
+#
+# Parallelism for the mel + embedding ONNX loop. Default 1 preserves the
+# legacy single-threaded behavior; 0 = os.cpu_count(); N = explicit count.
+# Workers pin ORT to 1 intra/inter-op thread to avoid thread explosion.
+#
+# ONNX Runtime providers: default is CPU-only. On a GPU host with
+# onnxruntime-gpu installed, use CUDAExecutionProvider.
+#
+# feature_extraction:
+#   n_workers: 0
+#   execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
+
+# ============================================================================
+# Evaluation (all fields optional — defaults to single-threaded CPU)
+# ============================================================================
+#
+# Same provider story as feature_extraction. batch_size default 1 is fine on
+# CPU; bump to 64+ on GPU to saturate the device.
+#
+# eval:
+#   execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
+#   batch_size: 64
+
 # ============================================================================
 # Model Architecture
 # ============================================================================

diff --git a/configs/test.yaml b/configs/test.yaml
@@ -39,6 +39,15 @@ augmentation:
   rounds: 3
   background_paths: [./data/backgrounds]
   rir_paths: [./data/rirs]
+  # n_workers: 0   # uncomment to use all CPU cores (10–100× faster)
+
+# feature_extraction:
+#   n_workers: 0
+#   execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
+#
+# eval:
+#   execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
+#   batch_size: 64
 
 # ============================================================================
 # Model Architecture

diff --git a/docs/augmentation.md b/docs/augmentation.md
@@ -125,3 +125,32 @@ output/<model_name>/
 Only `_rN.wav` files are fed to feature extraction — clean TTS originals are excluded from training since they don't match real microphone audio.
 
 Feature extraction is a separate step — see [Feature Extraction](feature-extraction.md).
+
+## Parallel Execution (`n_workers`)
+
+The per-clip loop in `_augment_directory` is a pure Python `for` over `soundfile.read`, `scipy.signal.fftconvolve`, and audiomentations transforms. Because of the GIL, adding CPU cores to the process does nothing on its own — each clip is processed sequentially on a single core. On a 32-CPU host, augmenting a 25k-clip dataset this way takes ~3 hours even though the work is embarrassingly parallel.
+
+`AugmentationConfig.n_workers` opts into a `multiprocessing.Pool` that runs the loop across worker processes. Each worker constructs its own `AudioAugmentor` via the pool's `initializer` callback — the parent's lazy-loaded audiomentations instance is never pickled, which keeps the setup robust even as upstream transforms evolve.
+
+Measured on a 32-CPU Modal container augmenting a 60k-clip dataset (25k positive_train + 5k positive_test + 25k negative_train + ~5k negative_test + ~2.5k backgrounds) end-to-end in **~6 minutes**:
+
+| Split | Throughput | Wall-clock |
+|---|---|---|
+| `positive_train` (25k) | 178 clips/sec | 2:20 |
+| `positive_test` (5k) | 174 clips/sec | 0:28 |
+| `negative_train` (25k) | 130 clips/sec | 3:12 |
+| `negative_test` (~5k) | 91 clips/sec | 0:53 |
+| `background_train` (2k) | 83 clips/sec | 0:24 |
+| `background_test` (500) | 62 clips/sec | 0:08 |
+
+For reference, the single-threaded path on the same host processes ~2.3 clips/sec, so the full 60k dataset would otherwise take ~7 hours.
+
+Semantics:
+
+- `n_workers: 1` (default) — the legacy single-threaded code path, unchanged.
+- `n_workers: 0` — auto, uses `os.cpu_count()`.
+- `n_workers: N` (any positive integer) — explicit worker count.
+
+`mp_context` controls the start method: `"auto"` picks `fork` on Linux/macOS and `spawn` on Windows. Override only if a fork-unsafe audio backend is crashing workers.
+
+Output file names, round-0 alignment, padding, and RIR / background mixing behave identically to the single-threaded path. Per-worker random state means the *exact* audio content differs run-to-run across paths (different SNR draws, different RIR picks), but the output shape, count, and naming are byte-for-byte the same — which is what the downstream feature extractor depends on.
diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -154,3 +154,15 @@ uv run livekit-wakeword eval configs/hey_livekit.yaml -m models/hey_livekit_oww.
 ```
 
 This works because both livekit-wakeword and openWakeWord share the same frozen embedding front-end, producing identical `(16, 96)` feature matrices.
+
+## ONNX Execution Providers
+
+`EvalConfig.execution_providers` controls the ONNX Runtime providers used for the classifier inference session. Default `["CPUExecutionProvider"]` preserves existing behavior. On a GPU host with `onnxruntime-gpu` installed:
+
+```yaml
+eval:
+  execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
+  batch_size: 64   # default 1 is fine on CPU; bump on GPU
+```
+
+CPU-side the classifier is small enough that `batch_size: 1` is rarely the bottleneck; on GPU the per-launch overhead dominates, so batching is required to see any speedup.
diff --git a/docs/feature-extraction.md b/docs/feature-extraction.md
@@ -206,6 +206,24 @@ Only augmented clips (`clip_NNNNNN_rN.wav`) are processed — clean TTS original
 
 Audio files are read via `soundfile`, converted to float32, reduced to mono if stereo, and processed one clip at a time.
 
+### Parallel Execution (`feature_extraction.n_workers`)
+
+The per-clip mel + embedding loop in `extract_features_from_directory` is a pure Python `for` over two ONNX sessions. Under the GIL it pins to a single core; on a 32-CPU L40S container we measured ~3.5 clips/sec, which is ~4 h wall-clock for a 55k-clip dataset before training even starts.
+
+`FeatureExtractionConfig.n_workers` opts into a `multiprocessing.Pool` that spreads the loop across worker processes. Each worker builds its own mel + embedding ONNX sessions via the pool's `initializer` (ORT sessions are not pickle-safe) and pins each session to a single intra-/inter-op thread — otherwise `n_workers` × N ORT threads thread-explode on multi-core hosts.
+
+The pool uses `pool.imap` (ordered, not `imap_unordered`) so the per-clip order within a split is preserved — the downstream classifier training relies on consistent sample ordering.
+
+```yaml
+feature_extraction:
+  n_workers: 0   # 0 = os.cpu_count(); 1 = single-threaded (default)
+  execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
+```
+
+### ONNX Execution Providers
+
+`FeatureExtractionConfig.execution_providers` is plumbed through `MelSpectrogramFrontend.__init__` and `SpeechEmbedding.__init__`. The default `["CPUExecutionProvider"]` keeps existing behavior; on a GPU host with `onnxruntime-gpu` installed, setting `["CUDAExecutionProvider", "CPUExecutionProvider"]` offloads mel + embedding inference to the GPU with CPU as a fallback.
+
 ## Memory-Mapped Dataset
 
 **Source:** `src/livekit/wakeword/data/dataset.py`

diff --git a/src/livekit/wakeword/config.py b/src/livekit/wakeword/config.py
@@ -5,7 +5,7 @@
 import logging
 from enum import StrEnum
 from pathlib import Path
-from typing import Annotated, Self
+from typing import Annotated, Literal, Self
 
 import yaml
 from pydantic import BaseModel, Field, model_validator
@@ -58,6 +58,52 @@ class AugmentationConfig(BaseModel):
     background_paths: list[str] = Field(default_factory=lambda: ["./data/backgrounds"])
     rir_paths: list[str] = Field(default_factory=lambda: ["./data/rirs"])
 
+    n_workers: int = 1
+    """Number of parallel worker processes for the audio DSP loop.
+    0 = auto (os.cpu_count()), 1 = single-threaded (legacy, default for
+    backwards compatibility), N = explicit worker count."""
+
+    mp_context: Literal["auto", "fork", "spawn", "forkserver"] = "auto"
+    """Multiprocessing start method. "auto" picks 'fork' on Linux/macOS
+    and 'spawn' on Windows. Override only if a fork-unsafe audio backend
+    is crashing workers."""
+
+
+class FeatureExtractionConfig(BaseModel):
+    """Configuration for the feature-extraction stage (mel + embedding ONNX models)."""
+
+    n_workers: int = 1
+    """Number of parallel worker processes for the feature-extraction loop.
+    0 = auto (os.cpu_count()), 1 = single-threaded (legacy, default for
+    backwards compatibility), N = explicit worker count."""
+
+    mp_context: Literal["auto", "fork", "spawn", "forkserver"] = "auto"
+    """Multiprocessing start method. "auto" picks 'fork' on Linux/macOS
+    and 'spawn' on Windows."""
+
+    execution_providers: list[str] = Field(
+        default_factory=lambda: ["CPUExecutionProvider"],
+    )
+    """ONNX Runtime execution providers, in priority order. Default preserves
+    CPU-only behavior. Set to ["CUDAExecutionProvider", "CPUExecutionProvider"]
+    on a GPU host to offload mel + embedding inference to the GPU (requires
+    onnxruntime-gpu)."""
+
+
+class EvalConfig(BaseModel):
+    """Configuration for the eval stage (classifier ONNX inference)."""
+
+    execution_providers: list[str] = Field(
+        default_factory=lambda: ["CPUExecutionProvider"],
+    )
+    """ONNX Runtime execution providers, in priority order. Default preserves
+    CPU-only behavior. Set to ["CUDAExecutionProvider", "CPUExecutionProvider"]
+    on a GPU host (requires onnxruntime-gpu)."""
+
+    batch_size: int = 1
+    """Batch size for classifier inference. 1 is fine on CPU; bump to 64+
+    when running on GPU to saturate the device."""
+
 
 class ModelConfig(BaseModel):
     model_type: ModelType = ModelType.conv_attention
@@ -141,6 +187,12 @@ class WakeWordConfig(BaseModel):
     # Augmentation
     augmentation: AugmentationConfig = Field(default_factory=AugmentationConfig)
 
+    # Feature extraction
+    feature_extraction: FeatureExtractionConfig = Field(default_factory=FeatureExtractionConfig)
+
+    # Evaluation
+    eval: EvalConfig = Field(default_factory=EvalConfig)
+
     # Model
     model: ModelConfig = Field(default_factory=ModelConfig)