Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions configs/prod.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,38 @@ augmentation:
# Room impulse response directories for reverb (downloaded via `setup`)
rir_paths: [./data/rirs]

# Parallelism for the per-clip DSP loop. Default 1 preserves the legacy
# single-threaded code path. On a multi-core host (e.g. 32-CPU Modal
# container) setting this to 0 uses all cores and gives a ~10-100x
# speedup for the augmentation stage.
# n_workers: 0 # uncomment to use all CPU cores (10–100× faster)

# ============================================================================
# Feature Extraction (all fields optional — defaults to single-threaded CPU)
# ============================================================================
#
# Parallelism for the mel + embedding ONNX loop. Default 1 preserves the
# legacy single-threaded behavior; 0 = os.cpu_count(); N = explicit count.
# Workers pin ORT to 1 intra/inter-op thread to avoid thread explosion.
#
# ONNX Runtime providers: default is CPU-only. On a GPU host with
# onnxruntime-gpu installed, use CUDAExecutionProvider.
#
# feature_extraction:
# n_workers: 0
# execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]

# ============================================================================
# Evaluation (all fields optional — defaults to single-threaded CPU)
# ============================================================================
#
# Same provider story as feature_extraction. batch_size default 1 is fine on
# CPU; bump to 64+ on GPU to saturate the device.
#
# eval:
# execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
# batch_size: 64

# ============================================================================
# Model Architecture
# ============================================================================
Expand Down
9 changes: 9 additions & 0 deletions configs/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,15 @@ augmentation:
rounds: 3
background_paths: [./data/backgrounds]
rir_paths: [./data/rirs]
# n_workers: 0 # uncomment to use all CPU cores (10–100× faster)

# feature_extraction:
# n_workers: 0
# execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
#
# eval:
# execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
# batch_size: 64

# ============================================================================
# Model Architecture
Expand Down
29 changes: 29 additions & 0 deletions docs/augmentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,3 +125,32 @@ output/<model_name>/
Only `_rN.wav` files are fed to feature extraction — clean TTS originals are excluded from training since they don't match real microphone audio.

Feature extraction is a separate step — see [Feature Extraction](feature-extraction.md).

## Parallel Execution (`n_workers`)

The per-clip loop in `_augment_directory` is a pure Python `for` over `soundfile.read`, `scipy.signal.fftconvolve`, and audiomentations transforms. Because of the GIL, adding CPU cores to the process does nothing on its own — each clip is processed sequentially on a single core. On a 32-CPU host, augmenting a 25k-clip dataset this way takes ~3 hours even though the work is embarrassingly parallel.

`AugmentationConfig.n_workers` opts into a `multiprocessing.Pool` that runs the loop across worker processes. Each worker constructs its own `AudioAugmentor` via the pool's `initializer` callback — the parent's lazy-loaded audiomentations instance is never pickled, which keeps the setup robust even as upstream transforms evolve.

Measured on a 32-CPU Modal container augmenting a 60k-clip dataset (25k positive_train + 5k positive_test + 25k negative_train + ~5k negative_test + ~2.5k backgrounds) end-to-end in **~6 minutes**:

| Split | Throughput | Wall-clock |
|---|---|---|
| `positive_train` (25k) | 178 clips/sec | 2:20 |
| `positive_test` (5k) | 174 clips/sec | 0:28 |
| `negative_train` (25k) | 130 clips/sec | 3:12 |
| `negative_test` (~5k) | 91 clips/sec | 0:53 |
| `background_train` (2k) | 83 clips/sec | 0:24 |
| `background_test` (500) | 62 clips/sec | 0:08 |

For reference, the single-threaded path on the same host processes ~2.3 clips/sec, so the full 60k dataset would otherwise take ~7 hours.

Semantics:

- `n_workers: 1` (default) — the legacy single-threaded code path, unchanged.
- `n_workers: 0` — auto, uses `os.cpu_count()`.
- `n_workers: N` (any positive integer) — explicit worker count.

`mp_context` controls the start method: `"auto"` picks `fork` on Linux/macOS and `spawn` on Windows. Override only if a fork-unsafe audio backend is crashing workers.

Output file names, round-0 alignment, padding, and RIR / background mixing behave identically to the single-threaded path. Per-worker random state means the *exact* audio content differs run-to-run across paths (different SNR draws, different RIR picks), but the output shape, count, and naming are byte-for-byte the same — which is what the downstream feature extractor depends on.
12 changes: 12 additions & 0 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,3 +154,15 @@ uv run livekit-wakeword eval configs/hey_livekit.yaml -m models/hey_livekit_oww.
```

This works because both livekit-wakeword and openWakeWord share the same frozen embedding front-end, producing identical `(16, 96)` feature matrices.

## ONNX Execution Providers

`EvalConfig.execution_providers` controls the ONNX Runtime providers used for the classifier inference session. Default `["CPUExecutionProvider"]` preserves existing behavior. On a GPU host with `onnxruntime-gpu` installed:

```yaml
eval:
execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
batch_size: 64 # default 1 is fine on CPU; bump on GPU
```

CPU-side the classifier is small enough that `batch_size: 1` is rarely the bottleneck; on GPU the per-launch overhead dominates, so batching is required to see any speedup.
18 changes: 18 additions & 0 deletions docs/feature-extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,24 @@ Only augmented clips (`clip_NNNNNN_rN.wav`) are processed — clean TTS original

Audio files are read via `soundfile`, converted to float32, reduced to mono if stereo, and processed one clip at a time.

### Parallel Execution (`feature_extraction.n_workers`)

The per-clip mel + embedding loop in `extract_features_from_directory` is a pure Python `for` over two ONNX sessions. Under the GIL it pins to a single core; on a 32-CPU L40S container we measured ~3.5 clips/sec, which is ~4 h wall-clock for a 55k-clip dataset before training even starts.

`FeatureExtractionConfig.n_workers` opts into a `multiprocessing.Pool` that spreads the loop across worker processes. Each worker builds its own mel + embedding ONNX sessions via the pool's `initializer` (ORT sessions are not pickle-safe) and pins each session to a single intra-/inter-op thread — otherwise `n_workers` × N ORT threads thread-explode on multi-core hosts.

The pool uses `pool.imap` (ordered, not `imap_unordered`) so the per-clip order within a split is preserved — the downstream classifier training relies on consistent sample ordering.

```yaml
feature_extraction:
n_workers: 0 # 0 = os.cpu_count(); 1 = single-threaded (default)
execution_providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
```

### ONNX Execution Providers

`FeatureExtractionConfig.execution_providers` is plumbed through `MelSpectrogramFrontend.__init__` and `SpeechEmbedding.__init__`. The default `["CPUExecutionProvider"]` keeps existing behavior; on a GPU host with `onnxruntime-gpu` installed, setting `["CUDAExecutionProvider", "CPUExecutionProvider"]` offloads mel + embedding inference to the GPU with CPU as a fallback.

## Memory-Mapped Dataset

**Source:** `src/livekit/wakeword/data/dataset.py`
Expand Down
54 changes: 53 additions & 1 deletion src/livekit/wakeword/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import logging
from enum import StrEnum
from pathlib import Path
from typing import Annotated, Self
from typing import Annotated, Literal, Self

import yaml
from pydantic import BaseModel, Field, model_validator
Expand Down Expand Up @@ -58,6 +58,52 @@ class AugmentationConfig(BaseModel):
background_paths: list[str] = Field(default_factory=lambda: ["./data/backgrounds"])
rir_paths: list[str] = Field(default_factory=lambda: ["./data/rirs"])

n_workers: int = 1
"""Number of parallel worker processes for the audio DSP loop.
0 = auto (os.cpu_count()), 1 = single-threaded (legacy, default for
backwards compatibility), N = explicit worker count."""

mp_context: Literal["auto", "fork", "spawn", "forkserver"] = "auto"
"""Multiprocessing start method. "auto" picks 'fork' on Linux/macOS
and 'spawn' on Windows. Override only if a fork-unsafe audio backend
is crashing workers."""


class FeatureExtractionConfig(BaseModel):
"""Configuration for the feature-extraction stage (mel + embedding ONNX models)."""

n_workers: int = 1
"""Number of parallel worker processes for the feature-extraction loop.
0 = auto (os.cpu_count()), 1 = single-threaded (legacy, default for
backwards compatibility), N = explicit worker count."""

mp_context: Literal["auto", "fork", "spawn", "forkserver"] = "auto"
"""Multiprocessing start method. "auto" picks 'fork' on Linux/macOS
and 'spawn' on Windows."""

execution_providers: list[str] = Field(
default_factory=lambda: ["CPUExecutionProvider"],
)
"""ONNX Runtime execution providers, in priority order. Default preserves
CPU-only behavior. Set to ["CUDAExecutionProvider", "CPUExecutionProvider"]
on a GPU host to offload mel + embedding inference to the GPU (requires
onnxruntime-gpu)."""


class EvalConfig(BaseModel):
"""Configuration for the eval stage (classifier ONNX inference)."""

execution_providers: list[str] = Field(
default_factory=lambda: ["CPUExecutionProvider"],
)
"""ONNX Runtime execution providers, in priority order. Default preserves
CPU-only behavior. Set to ["CUDAExecutionProvider", "CPUExecutionProvider"]
on a GPU host (requires onnxruntime-gpu)."""

batch_size: int = 1
"""Batch size for classifier inference. 1 is fine on CPU; bump to 64+
when running on GPU to saturate the device."""


class ModelConfig(BaseModel):
model_type: ModelType = ModelType.conv_attention
Expand Down Expand Up @@ -141,6 +187,12 @@ class WakeWordConfig(BaseModel):
# Augmentation
augmentation: AugmentationConfig = Field(default_factory=AugmentationConfig)

# Feature extraction
feature_extraction: FeatureExtractionConfig = Field(default_factory=FeatureExtractionConfig)

# Evaluation
eval: EvalConfig = Field(default_factory=EvalConfig)

# Model
model: ModelConfig = Field(default_factory=ModelConfig)

Expand Down
Loading
Loading