Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
fe14c69
docs: add training data pipeline design doc
wavekat-eason Apr 17, 2026
334e583
docs: use OpenRouter for LLM and Qwen3-TTS VoiceDesign for TTS
wavekat-eason Apr 17, 2026
c2e684f
docs: LLM generates TTS instructs per sample
wavekat-eason Apr 18, 2026
6d58b9e
docs: define incomplete turn patterns and resolve open questions
wavekat-eason Apr 18, 2026
7322f30
docs: reorganize training docs and add dataset research
wavekat-eason Apr 18, 2026
32f4b28
feat: add ASR transcription notebook with Paraformer-zh
wavekat-eason Apr 19, 2026
64ea65d
chore: ignore ASR results output file
wavekat-eason Apr 19, 2026
a94f80d
feat: add MPS support and speed benchmarks to ASR notebook
wavekat-eason Apr 19, 2026
99453bd
feat: add Silero VAD notebook and VAD comparison doc
wavekat-eason Apr 19, 2026
ab73047
feat: raw VAD probs, visualization, make install
wavekat-eason Apr 19, 2026
778a3b8
docs: add data structures reference for ASR and VAD results
wavekat-eason Apr 19, 2026
65506ac
feat: save ASR results as per-file JSON
wavekat-eason Apr 19, 2026
86eed0c
docs: update data structures for per-file ASR output
wavekat-eason Apr 19, 2026
8cfc663
docs: add audio viewer plan
wavekat-eason Apr 19, 2026
e89af50
feat: add audio viewer (Phases 1-3)
wavekat-eason Apr 19, 2026
b3b8c10
feat: viewer char-level timing, segments, dB scale, multi-file open
wavekat-eason Apr 19, 2026
c997faf
feat: dB default, dB grid lines, LOD-aligned zoom presets
wavekat-eason Apr 20, 2026
e72be58
feat: volume control, gap highlighting, char label splits
wavekat-eason Apr 20, 2026
add1f9f
chore: gitignore nested wav, notebook outputs
wavekat-eason Apr 20, 2026
7f49034
fix: remove file type filter from Open files picker
wavekat-eason Apr 20, 2026
6e23612
chore: update notebook execution outputs
wavekat-eason Apr 20, 2026
f04132f
feat: add notebook to group wav/vad/asr for viewer
wavekat-eason Apr 20, 2026
df559ef
refactor: migrate viewer to React
wavekat-eason Apr 20, 2026
9e32c88
feat: add spectrogram to audio viewer
wavekat-eason Apr 20, 2026
8e0d665
feat: add resizable panels and right-side ASR layout
wavekat-eason Apr 20, 2026
58590b9
feat: polish viewer UI and add zoom buttons
wavekat-eason Apr 20, 2026
cb3a050
feat: highlight ASR sentence at cursor and slow scroll
wavekat-eason Apr 20, 2026
408adf2
fix: reduce scroll zoom sensitivity
wavekat-eason Apr 20, 2026
564a71a
feat: add segment playback on ASR timestamp hover
wavekat-eason Apr 20, 2026
8c8b01d
feat: add direct waveform rendering and view span label
wavekat-eason Apr 20, 2026
f7eff17
feat: add loop range selection and repeat play
wavekat-eason Apr 20, 2026
4a20d35
feat: show loop duration in toolbar
wavekat-eason Apr 20, 2026
3299f8f
feat: add VAD block navigation and shortcuts dialog
wavekat-eason Apr 20, 2026
74abde4
feat: add ASR sentence navigation with { / }
wavekat-eason Apr 20, 2026
cbd405f
fix: only pan view on nav jump when target is off-screen
wavekat-eason Apr 20, 2026
4d5c6e3
chore: add Google Analytics tracking
wavekat-eason Apr 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,25 @@ __pycache__/

# Generated mel reference tensors (regenerate with scripts/gen_reference.py)
*.mel.npy

# Jupyter checkpoints
.ipynb_checkpoints/

# Training venvs
training/**/.venv/

# Training data
training/smart-turn-zh/data/wav/*.wav
training/smart-turn-zh/data/**/*.wav
training/smart-turn-zh/data/*.jsonl
training/smart-turn-zh/data/vad_probs/
training/smart-turn-zh/data/asr_results/
training/smart-turn-zh/data/example/
training/smart-turn-zh/data/grouped/

# Training references
training/smart-turn-zh/refs/*.pdf

# Viewer
training/smart-turn-zh/viewer/node_modules/
training/smart-turn-zh/viewer/dist/
File renamed without changes.
File renamed without changes.
File renamed without changes.
36 changes: 36 additions & 0 deletions training/smart-turn-zh/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
VENV := .venv
PYTHON := $(VENV)/bin/python
PIP := $(VENV)/bin/pip

.PHONY: help venv install notebook viewer clean

help:
@echo "Available targets:"
@echo " venv Create virtualenv and install dependencies"
@echo " install Re-install dependencies into existing venv"
@echo " notebook Launch Jupyter notebook server"
@echo " viewer Launch audio viewer dev server"
@echo " clean Remove virtualenv"

venv: $(VENV)/bin/activate

$(VENV)/bin/activate: requirements.txt
python3 -m venv $(VENV)
$(PIP) install --upgrade pip
$(PIP) install -r requirements.txt
touch $(VENV)/bin/activate

install: venv
$(PIP) install -r requirements.txt

notebook: venv
$(VENV)/bin/jupyter lab notebooks/

viewer: viewer/node_modules
cd viewer && npm run dev

viewer/node_modules: viewer/package.json
cd viewer && npm install

clean:
rm -rf $(VENV)
20 changes: 20 additions & 0 deletions training/smart-turn-zh/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Smart Turn — Mandarin (smart-turn-zh)

Mandarin-Chinese variant of the Smart Turn detector: fine-tuning the upstream
[Pipecat Smart Turn](../pipecat-smart-turn/) architecture (Whisper-Tiny encoder
+ binary classification head) on Chinese conversational audio.

## Layout

- [`plan-data.md`](plan-data.md) — dataset construction plan (under revision)
- [`research/`](research/) — surveys and open questions feeding the plans
- [`01-datasets.md`](research/01-datasets.md) — OpenSLR + HuggingFace dataset survey
- `data/` — data pipeline scripts
- `notebooks/` — Jupyter exploration
- `train/` — training scripts (later)

## Status

Design phase. The data pipeline plan in `plan-data.md` is the original
LLM-rewriting + TTS approach; we are revising toward real conversational
corpora — see `research/datasets.md` for the source-corpus analysis.
Empty file.
76 changes: 76 additions & 0 deletions training/smart-turn-zh/docs/01-vad-comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# VAD Model Comparison for Chinese Filler Detection

## Use Case

Detect speech regions in Chinese podcast audio, including soft fillers
(呃/嗯/啊) that ASR typically skips. The goal: **VAD active ∧ ASR silent =
filler candidates** (PodcastFillers, Zhu et al. 2022).

Key requirements:
- Catches soft/quiet speech (fillers are often quiet)
- Configurable activation threshold (paper found 0.1 critical)
- Fine temporal resolution (10ms ideal)
- Runs on MacBook Pro (Apple Silicon)

## Comparison

| | FSMN-VAD | Silero VAD | WebRTC VAD | pyannote.audio | TEN VAD |
|---|---|---|---|---|---|
| **Source** | Alibaba / FunASR | Silero | Google | pyannote | TEN Framework |
| **Params** | 0.4M | ~70KB (quantized) | <1MB compiled | ~68M | lightweight ONNX |
| **Resolution** | 200ms chunks | 10ms+ configurable | 10/20/30ms fixed | 10ms | 10–16ms |
| **Threshold** | Configurable (speech_noise_thres) | Probability output, fully tunable | Aggressiveness 0–3 (coarse) | Probability output, fully tunable | Configurable, default 0.5 |
| **Chinese** | Native (5000h Mandarin) | General (6000+ langs, no zh-specific) | Language-agnostic | Trained on AISHELL + AliMeeting | General |
| **Soft fillers** | Good, but 200ms chunks may blur boundaries | Catches at low threshold (~0.1–0.3) | May miss quiet fillers | Best accuracy on soft boundaries | Less documented |
| **Mac perf** | CPU, fast for 0.4M | ~40μs/chunk on M2 Max | Very fast CPU | Slow on CPU (68M params) | arm64 native + ONNX |
| **MPS/GPU** | Yes (via FunASR) | MLX native | N/A (CPU only) | MPS supported | ONNX only |
| **License** | Model-specific (check HF) | MIT | BSD | MIT | Apache 2.0 |

## Analysis

### FSMN-VAD
- **Pro**: Already in our stack (FunASR). Chinese-native, production-proven.
- **Con**: 200ms chunk size is coarser than ideal. Fine boundary detection
for short fillers (150–400ms) may lose precision.

### Silero VAD
- **Pro**: Tiny, fast, 10ms resolution, MIT license. Threshold tunable to
~0.1 for soft speech. MLX native on Apple Silicon.
- **Con**: Not Chinese-optimized. Lightweight design means less sophisticated
boundary detection.

### WebRTC VAD
- **Pro**: Mature, fast, 10ms native resolution.
- **Con**: No probability output — binary decisions with coarse aggressiveness
levels (0–3). Hard to tune for soft fillers. No fine-grained threshold.

### pyannote.audio (segmentation-3.0)
- **Pro**: Best accuracy. 10ms resolution. Trained on Chinese datasets
(AISHELL, AliMeeting). Best at catching soft speech boundaries.
- **Con**: 68M params — slow on CPU. Overkill if we only need binary VAD.

### TEN VAD
- **Pro**: 10ms resolution, superior precision vs WebRTC and Silero. Apache 2.0.
- **Con**: Newer (2024–2025), fewer production deployments. Less documented
for Chinese/soft speech.

## Recommendation

**Silero VAD** as primary choice:
- 10ms resolution matches what the paper uses
- Threshold tunable to 0.1 (critical finding from the paper)
- Tiny and fast on MacBook
- MIT license
- Good enough for candidate generation — the classifier stage handles precision

**FSMN-VAD** as comparison baseline since it's already in our pipeline.

If accuracy on soft boundaries proves insufficient, upgrade to **pyannote.audio**.

## References

- [PodcastFillers paper](../refs/2203.pdf) — Section 2.2: VAD threshold 0.1, candidates 150ms–2s
- [Silero VAD](https://github.com/snakers4/silero-vad)
- [FSMN-VAD](https://huggingface.co/funasr/fsmn-vad)
- [pyannote.audio](https://github.com/pyannote/pyannote-audio)
- [TEN VAD](https://github.com/TEN-framework/ten-vad)
114 changes: 114 additions & 0 deletions training/smart-turn-zh/docs/02-data-structures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Data Structures

Artifacts produced by the notebook pipeline (`01-asr-transcribe`, `02-vad`) and their schemas.

## Directory Layout

```
data/
├── wav/ # Source audio
│ ├── R8001_M8004_MS801.wav # 8-ch, 16 kHz, PCM-16
│ └── R8003_M8001_MS801.wav
├── asr_results/ # Per-file ASR transcriptions
│ ├── R8001_M8004_MS801.json
│ └── R8003_M8001_MS801.json
└── vad_probs/ # Per-frame speech probabilities
├── R8001_M8004_MS801.npy
└── R8003_M8001_MS801.npy
```

## Source Audio (`data/wav/*.wav`)

| Property | Value |
|-------------|--------------------------------|
| Format | RIFF WAVE, PCM 16-bit |
| Channels | 8 (per-speaker headset mics) |
| Sample rate | 16 kHz |
| Source | AliMeeting (SLR-119) meetings |

## ASR Results (`data/asr_results/*.json`)

**Format**: JSON — one file per WAV, named `{wav_stem}.json`.
**Producer**: `notebooks/01-asr-transcribe.ipynb` (Paraformer-zh + FSMN-VAD + ct-punc).

### File Schema

Each file contains a JSON array of record objects:

```jsonc
[
{
"text": "全文转写结果...", // full transcription (punctuated)
"sentences": [ /* see below */ ],
"timestamp": [ /* see below */ ]
}
]
```

### `sentences` Array Element

Each element is one sentence/chunk segmented by the ASR model.

```jsonc
{
"text": "啊,", // punctuated text
"raw_text": "啊", // text without punctuation
"start": 7130, // start time (ms)
"end": 7370, // end time (ms)
"timestamp": [[7130, 7370]] // per-word [start_ms, end_ms] pairs
}
```

| Field | Type | Unit | Description |
|-------------|------------------|------|------------------------------------------|
| `text` | string | — | Sentence with restored punctuation |
| `raw_text` | string | — | Same sentence, no punctuation |
| `start` | int | ms | Sentence start time |
| `end` | int | ms | Sentence end time |
| `timestamp` | array of [int, int] | ms | Per-word start/end pairs (10 ms frames) |

### Top-level `timestamp` Array

Flat array of all per-word `[start_ms, end_ms]` pairs across the entire file (same data as the union of per-sentence timestamps).

## VAD Probabilities (`data/vad_probs/*.npy`)

**Format**: NumPy `.npy`, 1-D `float32` array.
**Producer**: `notebooks/02-vad.ipynb` (Silero VAD).

| Property | Value |
|-------------------|-------------------------------|
| Shape | `(num_frames,)` |
| Dtype | `float32` |
| Frame size | 512 samples = **32 ms** @ 16 kHz |
| Value range | `[0.0, 1.0]` — P(speech) |
| Index → time | `frame[i]` → `i * 32 ms` |

### Loading

```python
import numpy as np
probs = np.load("data/vad_probs/R8001_M8004_MS801.npy")
# probs[i] = speech probability at time i * 32 ms
```

### File Details

| File | Frames | Duration |
|---------------------------|---------|------------|
| `R8001_M8004_MS801.npy` | 49,183 | ~1,573.9 s |
| `R8003_M8001_MS801.npy` | 64,625 | ~2,068.0 s |

## Cross-referencing ASR and VAD

ASR timestamps are in **milliseconds**; VAD frames are **32 ms** each.

```python
# Convert ASR ms timestamp to VAD frame index
vad_frame = asr_start_ms // 32

# Convert VAD frame index to ms
time_ms = vad_frame * 32
```

This alignment is used in the next pipeline step (filler candidate extraction): regions where VAD is active (`prob > threshold`) but ASR produces no recognized words.
Loading