Skip to content

feat: inject user-supplied recordings into positive_train via custom_positive_samples#69

Open
bnovik0v wants to merge 1 commit intolivekit:mainfrom
bnovik0v:feat/custom-positive-samples
Open

feat: inject user-supplied recordings into positive_train via custom_positive_samples#69
bnovik0v wants to merge 1 commit intolivekit:mainfrom
bnovik0v:feat/custom-positive-samples

Conversation

@bnovik0v
Copy link
Copy Markdown

Closes #60.

Summary

Adds a custom_positive_samples config field that lets users inject real WAV recordings of the target phrase into positive_train alongside the TTS-generated clips. Each source directory has a multiplier for oversampling.

custom_positive_samples:
  - path: ./data/my_recordings
    multiplier: 50
  - path: ./data/other_voices
    multiplier: 10

This covers the common case of biasing a wake word model toward a specific voice (yours, a customer's, a target demographic) without giving up the breadth of the Piper SLERP speaker pool or VoxCPM voice design.

Design note: multiplier vs. sampling weight

Your comment on #60 said "a custom positives folder and set a weight for that." I chose oversampling-by-duplication (multiplier) over sampling-weight (new batch class) for three reasons:

  1. The training sampler (mmap_batch_generator in data/dataset.py) cycles through each class's .npy deterministically — it isn't a WeightedRandomSampler. Adding proper sampling-weight would either require a new class in batch_n_per_class with its own feature file + trainer plumbing, or a full sampler rewrite. Both are much larger changes.
  2. Augmentation rounds interact naturally with duplication — each copy gets different RIR / background / EQ per round, so duplicates aren't wasted. A user with ~140 recordings × multiplier: 50 × augmentation.rounds: 2 ends up with ~14,000 unique augmented features.
  3. The user-facing mental model matches: "I want my voice to count 50× more" maps directly to a multiplier.

I'm happy to rework this to the sampling-weight shape if you'd prefer — the field is deliberately named multiplier (not weight) so a future sampling_weight can land alongside without breaking the API.

What's in this PR

File Change
`src/livekit/wakeword/config.py` New `CustomPositiveSource` pydantic model (path + `multiplier: int ≥ 1`); new `custom_positive_samples: list[CustomPositiveSource]` field on `WakeWordConfig`, default `[]`
`src/livekit/wakeword/data/generate.py` New `_copy_custom_positives()` helper; 12-line call site in `run_generate` after the positive_train TTS loop
`tests/test_custom_positives.py` 20 tests: pydantic validation, copy semantics, resume idempotency, error handling, YAML parsing, end-to-end integration with a faked TTS backend
`docs/data-generation.md` New "Custom Positive Samples" section explaining config, augmentation interaction, and constraints
`configs/prod.yaml` Commented-out example after `custom_negative_phrases`

Key decisions flagged for review

  • Pipeline position: injection runs after positive_train TTS, before negatives. `start_index=config.n_samples` is pinned (not computed from live count) so layout is deterministic even when TTS OOM-skipped clips — gaps in numbering are harmless to the augmentation regex.
  • Hard-fail on missing path: differs from `background_paths` (silent skip). A typo in `custom_positive_samples.path` would silently train a worse model; failing fast is safer.
  • No auto-resample: non-16 kHz or stereo input raises `ValueError` with a `sox` one-liner in the message. Explicit > magic; users who need resampling can pre-convert.
  • Train only in v1: custom recordings don't enter `positive_test`. A follow-up PR can add an optional `split:` field and wire eval's `custom_recall` — that's a larger surface area (`eval/evaluate.py`, DET reporting) and worth its own discussion.
  • Non-.wav files in source dirs are warned about but ignored (matches the behavior of other glob-based paths in the pipeline).
  • Helper is `_` prefixed: treated as an internal implementation detail. Happy to rename if you want it public.

Verification

```
uv run ruff check src/livekit/wakeword/config.py src/livekit/wakeword/data/generate.py tests/test_custom_positives.py # clean
uv run ruff format --check src/livekit/wakeword/config.py src/livekit/wakeword/data/generate.py tests/test_custom_positives.py # clean
uv run mypy src/livekit/wakeword/ # same 1 pre-existing error as main (dup module in tts_constants.py)
uv run pytest tests/ # 80 passed (60 existing + 20 new)
```

Baseline ruff on main has 17 pre-existing errors in `dataset.py`, `onnx.py`, `listener.py`, `classifier.py`, and four test files — this PR adds zero new lints.

Context

Produced while migrating a custom-wake-word training pipeline from openWakeWord onto livekit-wakeword for a RunPod 3090 run. Before this PR we had an out-of-tree helper that renamed ~140 personal recordings into `clip_NNNNNN.wav` and dropped them into `positive_train/` after `generate`, continuing the numbering. This PR upstreams that pattern into the library itself so it's resume-safe, validated, and config-driven.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 21, 2026

CLA assistant check
All committers have signed the CLA.

…positive_samples

Closes livekit#60.

Adds a `custom_positive_samples: list[CustomPositiveSource]` config field
that lets users inject real WAV recordings of the target phrase into
`positive_train` alongside the TTS-generated clips, with a `multiplier`
for oversampling.

  custom_positive_samples:
    - path: ./data/my_recordings
      multiplier: 50

Each file in each source is copied `multiplier` times into
`positive_train/` as `clip_NNNNNN.wav`, appended after the TTS clips
starting at `config.n_samples`. Copies enter the standard augmentation
pipeline so each gets different RIR/background/EQ per round. The helper
is resume-safe: existing output paths are skipped, so interrupted runs
pick up where they left off.

Validation: inputs must be 16 kHz mono; mismatches raise ValueError with
a sox one-liner. Missing source paths raise FileNotFoundError so typos
surface early instead of silently training a worse model.

Scope: train split only in v1. Held-out custom-voice eval (writing into
positive_test + reporting `custom_recall` from the evaluator) is worth a
separate PR since it touches eval/evaluate.py and DET reporting.
@bnovik0v bnovik0v force-pushed the feat/custom-positive-samples branch from 803a25b to ba25fe5 Compare April 21, 2026 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature] model fine tuning on user prepared samples

2 participants