feat: inject user-supplied recordings into positive_train via custom_positive_samples by bnovik0v · Pull Request #69 · livekit/livekit-wakeword

bnovik0v · 2026-04-21T17:12:26Z

Closes #60.

Summary

Adds a custom_positive_samples config field that lets users inject real WAV recordings of the target phrase into positive_train alongside the TTS-generated clips. Each source directory has a multiplier for oversampling.

custom_positive_samples:
  - path: ./data/my_recordings
    multiplier: 50
  - path: ./data/other_voices
    multiplier: 10

This covers the common case of biasing a wake word model toward a specific voice (yours, a customer's, a target demographic) without giving up the breadth of the Piper SLERP speaker pool or VoxCPM voice design.

Design note: multiplier vs. sampling weight

Your comment on #60 said "a custom positives folder and set a weight for that." I chose oversampling-by-duplication (multiplier) over sampling-weight (new batch class) for three reasons:

The training sampler (mmap_batch_generator in data/dataset.py) cycles through each class's .npy deterministically — it isn't a WeightedRandomSampler. Adding proper sampling-weight would either require a new class in batch_n_per_class with its own feature file + trainer plumbing, or a full sampler rewrite. Both are much larger changes.
Augmentation rounds interact naturally with duplication — each copy gets different RIR / background / EQ per round, so duplicates aren't wasted. A user with ~140 recordings × multiplier: 50 × augmentation.rounds: 2 ends up with ~14,000 unique augmented features.
The user-facing mental model matches: "I want my voice to count 50× more" maps directly to a multiplier.

I'm happy to rework this to the sampling-weight shape if you'd prefer — the field is deliberately named multiplier (not weight) so a future sampling_weight can land alongside without breaking the API.

What's in this PR

File	Change
`src/livekit/wakeword/config.py`	New `CustomPositiveSource` pydantic model (path + `multiplier: int ≥ 1`); new `custom_positive_samples: list[CustomPositiveSource]` field on `WakeWordConfig`, default `[]`
`src/livekit/wakeword/data/generate.py`	New `_copy_custom_positives()` helper; 12-line call site in `run_generate` after the positive_train TTS loop
`tests/test_custom_positives.py`	20 tests: pydantic validation, copy semantics, resume idempotency, error handling, YAML parsing, end-to-end integration with a faked TTS backend
`docs/data-generation.md`	New "Custom Positive Samples" section explaining config, augmentation interaction, and constraints
`configs/prod.yaml`	Commented-out example after `custom_negative_phrases`

Key decisions flagged for review

Pipeline position: injection runs after positive_train TTS, before negatives. `start_index=config.n_samples` is pinned (not computed from live count) so layout is deterministic even when TTS OOM-skipped clips — gaps in numbering are harmless to the augmentation regex.
Hard-fail on missing path: differs from `background_paths` (silent skip). A typo in `custom_positive_samples.path` would silently train a worse model; failing fast is safer.
No auto-resample: non-16 kHz or stereo input raises `ValueError` with a `sox` one-liner in the message. Explicit > magic; users who need resampling can pre-convert.
Train only in v1: custom recordings don't enter `positive_test`. A follow-up PR can add an optional `split:` field and wire eval's `custom_recall` — that's a larger surface area (`eval/evaluate.py`, DET reporting) and worth its own discussion.
Non-.wav files in source dirs are warned about but ignored (matches the behavior of other glob-based paths in the pipeline).
Helper is `_` prefixed: treated as an internal implementation detail. Happy to rename if you want it public.

Verification

```
uv run ruff check src/livekit/wakeword/config.py src/livekit/wakeword/data/generate.py tests/test_custom_positives.py # clean
uv run ruff format --check src/livekit/wakeword/config.py src/livekit/wakeword/data/generate.py tests/test_custom_positives.py # clean
uv run mypy src/livekit/wakeword/ # same 1 pre-existing error as main (dup module in tts_constants.py)
uv run pytest tests/ # 80 passed (60 existing + 20 new)
```

Baseline ruff on main has 17 pre-existing errors in `dataset.py`, `onnx.py`, `listener.py`, `classifier.py`, and four test files — this PR adds zero new lints.

Context

Produced while migrating a custom-wake-word training pipeline from openWakeWord onto livekit-wakeword for a RunPod 3090 run. Before this PR we had an out-of-tree helper that renamed ~140 personal recordings into `clip_NNNNNN.wav` and dropped them into `positive_train/` after `generate`, continuing the numbering. This PR upstreams that pattern into the library itself so it's resume-safe, validated, and config-driven.

CLAassistant · 2026-04-21T17:12:33Z

All committers have signed the CLA.

…positive_samples Closes livekit#60. Adds a `custom_positive_samples: list[CustomPositiveSource]` config field that lets users inject real WAV recordings of the target phrase into `positive_train` alongside the TTS-generated clips, with a `multiplier` for oversampling. custom_positive_samples: - path: ./data/my_recordings multiplier: 50 Each file in each source is copied `multiplier` times into `positive_train/` as `clip_NNNNNN.wav`, appended after the TTS clips starting at `config.n_samples`. Copies enter the standard augmentation pipeline so each gets different RIR/background/EQ per round. The helper is resume-safe: existing output paths are skipped, so interrupted runs pick up where they left off. Validation: inputs must be 16 kHz mono; mismatches raise ValueError with a sox one-liner. Missing source paths raise FileNotFoundError so typos surface early instead of silently training a worse model. Scope: train split only in v1. Held-out custom-voice eval (writing into positive_test + reporting `custom_recall` from the evaluator) is worth a separate PR since it touches eval/evaluate.py and DET reporting.

bnovik0v force-pushed the feat/custom-positive-samples branch from 803a25b to ba25fe5 Compare April 21, 2026 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: inject user-supplied recordings into positive_train via custom_positive_samples#69

feat: inject user-supplied recordings into positive_train via custom_positive_samples#69
bnovik0v wants to merge 1 commit intolivekit:mainfrom
bnovik0v:feat/custom-positive-samples

bnovik0v commented Apr 21, 2026

Uh oh!

CLAassistant commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bnovik0v commented Apr 21, 2026

Summary

Design note: multiplier vs. sampling weight

What's in this PR

Key decisions flagged for review

Verification

Context

Uh oh!

CLAassistant commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Apr 21, 2026 •

edited

Loading