feat: inject user-supplied recordings into positive_train via custom_positive_samples#69
Open
bnovik0v wants to merge 1 commit intolivekit:mainfrom
Open
feat: inject user-supplied recordings into positive_train via custom_positive_samples#69bnovik0v wants to merge 1 commit intolivekit:mainfrom
bnovik0v wants to merge 1 commit intolivekit:mainfrom
Conversation
…positive_samples Closes livekit#60. Adds a `custom_positive_samples: list[CustomPositiveSource]` config field that lets users inject real WAV recordings of the target phrase into `positive_train` alongside the TTS-generated clips, with a `multiplier` for oversampling. custom_positive_samples: - path: ./data/my_recordings multiplier: 50 Each file in each source is copied `multiplier` times into `positive_train/` as `clip_NNNNNN.wav`, appended after the TTS clips starting at `config.n_samples`. Copies enter the standard augmentation pipeline so each gets different RIR/background/EQ per round. The helper is resume-safe: existing output paths are skipped, so interrupted runs pick up where they left off. Validation: inputs must be 16 kHz mono; mismatches raise ValueError with a sox one-liner. Missing source paths raise FileNotFoundError so typos surface early instead of silently training a worse model. Scope: train split only in v1. Held-out custom-voice eval (writing into positive_test + reporting `custom_recall` from the evaluator) is worth a separate PR since it touches eval/evaluate.py and DET reporting.
803a25b to
ba25fe5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #60.
Summary
Adds a
custom_positive_samplesconfig field that lets users inject real WAV recordings of the target phrase intopositive_trainalongside the TTS-generated clips. Each source directory has amultiplierfor oversampling.This covers the common case of biasing a wake word model toward a specific voice (yours, a customer's, a target demographic) without giving up the breadth of the Piper SLERP speaker pool or VoxCPM voice design.
Design note: multiplier vs. sampling weight
Your comment on #60 said "a custom positives folder and set a weight for that." I chose oversampling-by-duplication (multiplier) over sampling-weight (new batch class) for three reasons:
mmap_batch_generatorindata/dataset.py) cycles through each class's.npydeterministically — it isn't aWeightedRandomSampler. Adding proper sampling-weight would either require a new class inbatch_n_per_classwith its own feature file + trainer plumbing, or a full sampler rewrite. Both are much larger changes.multiplier: 50×augmentation.rounds: 2ends up with ~14,000 unique augmented features.I'm happy to rework this to the sampling-weight shape if you'd prefer — the field is deliberately named
multiplier(notweight) so a futuresampling_weightcan land alongside without breaking the API.What's in this PR
Key decisions flagged for review
Verification
```
uv run ruff check src/livekit/wakeword/config.py src/livekit/wakeword/data/generate.py tests/test_custom_positives.py # clean
uv run ruff format --check src/livekit/wakeword/config.py src/livekit/wakeword/data/generate.py tests/test_custom_positives.py # clean
uv run mypy src/livekit/wakeword/ # same 1 pre-existing error as main (dup module in tts_constants.py)
uv run pytest tests/ # 80 passed (60 existing + 20 new)
```
Baseline ruff on main has 17 pre-existing errors in `dataset.py`, `onnx.py`, `listener.py`, `classifier.py`, and four test files — this PR adds zero new lints.
Context
Produced while migrating a custom-wake-word training pipeline from openWakeWord onto livekit-wakeword for a RunPod 3090 run. Before this PR we had an out-of-tree helper that renamed ~140 personal recordings into `clip_NNNNNN.wav` and dropped them into `positive_train/` after `generate`, continuing the numbering. This PR upstreams that pattern into the library itself so it's resume-safe, validated, and config-driven.