wavekat · wavekat-eason · May 10, 2026 · Apr 17, 2026 · Apr 17, 2026 · Apr 18, 2026
diff --git a/.gitignore b/.gitignore
@@ -14,3 +14,25 @@ __pycache__/
 
 # Generated mel reference tensors (regenerate with scripts/gen_reference.py)
 *.mel.npy
+
+# Jupyter checkpoints
+.ipynb_checkpoints/
+
+# Training venvs
+training/**/.venv/
+
+# Training data
+training/smart-turn-zh/data/wav/*.wav
+training/smart-turn-zh/data/**/*.wav
+training/smart-turn-zh/data/*.jsonl
+training/smart-turn-zh/data/vad_probs/
+training/smart-turn-zh/data/asr_results/
+training/smart-turn-zh/data/example/
+training/smart-turn-zh/data/grouped/
+
+# Training references
+training/smart-turn-zh/refs/*.pdf
+
+# Viewer
+training/smart-turn-zh/viewer/node_modules/
+training/smart-turn-zh/viewer/dist/
diff --git a/docs/plan-accuracy.md → docs/01-plan-accuracy.md b/docs/plan-accuracy.md → docs/01-plan-accuracy.md
diff --git a/docs/plan-backends.md → docs/02-plan-backends.md b/docs/plan-backends.md → docs/02-plan-backends.md
diff --git a/docs/plan-turn-controller.md → docs/03-plan-turn-controller.md b/docs/plan-turn-controller.md → docs/03-plan-turn-controller.md
diff --git a/training/smart-turn-zh/Makefile b/training/smart-turn-zh/Makefile
@@ -0,0 +1,36 @@
+VENV := .venv
+PYTHON := $(VENV)/bin/python
+PIP := $(VENV)/bin/pip
+
+.PHONY: help venv install notebook viewer clean
+
+help:
+	@echo "Available targets:"
+	@echo "  venv       Create virtualenv and install dependencies"
+	@echo "  install    Re-install dependencies into existing venv"
+	@echo "  notebook   Launch Jupyter notebook server"
+	@echo "  viewer     Launch audio viewer dev server"
+	@echo "  clean      Remove virtualenv"
+
+venv: $(VENV)/bin/activate
+
+$(VENV)/bin/activate: requirements.txt
+	python3 -m venv $(VENV)
+	$(PIP) install --upgrade pip
+	$(PIP) install -r requirements.txt
+	touch $(VENV)/bin/activate
+
+install: venv
+	$(PIP) install -r requirements.txt
+
+notebook: venv
+	$(VENV)/bin/jupyter lab notebooks/
+
+viewer: viewer/node_modules
+	cd viewer && npm run dev
+
+viewer/node_modules: viewer/package.json
+	cd viewer && npm install
+
+clean:
+	rm -rf $(VENV)
diff --git a/training/smart-turn-zh/README.md b/training/smart-turn-zh/README.md
@@ -0,0 +1,20 @@
+# Smart Turn — Mandarin (smart-turn-zh)
+
+Mandarin-Chinese variant of the Smart Turn detector: fine-tuning the upstream
+[Pipecat Smart Turn](../pipecat-smart-turn/) architecture (Whisper-Tiny encoder
++ binary classification head) on Chinese conversational audio.
+
+## Layout
+
+- [`plan-data.md`](plan-data.md) — dataset construction plan (under revision)
+- [`research/`](research/) — surveys and open questions feeding the plans
+  - [`01-datasets.md`](research/01-datasets.md) — OpenSLR + HuggingFace dataset survey
+- `data/` — data pipeline scripts
+- `notebooks/` — Jupyter exploration
+- `train/` — training scripts (later)
+
+## Status
+
+Design phase. The data pipeline plan in `plan-data.md` is the original
+LLM-rewriting + TTS approach; we are revising toward real conversational
+corpora — see `research/datasets.md` for the source-corpus analysis.
diff --git a/training/smart-turn-zh/data/wav/.gitkeep b/training/smart-turn-zh/data/wav/.gitkeep
diff --git a/training/smart-turn-zh/docs/01-vad-comparison.md b/training/smart-turn-zh/docs/01-vad-comparison.md
@@ -0,0 +1,76 @@
+# VAD Model Comparison for Chinese Filler Detection
+
+## Use Case
+
+Detect speech regions in Chinese podcast audio, including soft fillers
+(呃/嗯/啊) that ASR typically skips. The goal: **VAD active ∧ ASR silent =
+filler candidates** (PodcastFillers, Zhu et al. 2022).
+
+Key requirements:
+- Catches soft/quiet speech (fillers are often quiet)
+- Configurable activation threshold (paper found 0.1 critical)
+- Fine temporal resolution (10ms ideal)
+- Runs on MacBook Pro (Apple Silicon)
+
+## Comparison
+
+| | FSMN-VAD | Silero VAD | WebRTC VAD | pyannote.audio | TEN VAD |
+|---|---|---|---|---|---|
+| **Source** | Alibaba / FunASR | Silero | Google | pyannote | TEN Framework |
+| **Params** | 0.4M | ~70KB (quantized) | <1MB compiled | ~68M | lightweight ONNX |
+| **Resolution** | 200ms chunks | 10ms+ configurable | 10/20/30ms fixed | 10ms | 10–16ms |
+| **Threshold** | Configurable (speech_noise_thres) | Probability output, fully tunable | Aggressiveness 0–3 (coarse) | Probability output, fully tunable | Configurable, default 0.5 |
+| **Chinese** | Native (5000h Mandarin) | General (6000+ langs, no zh-specific) | Language-agnostic | Trained on AISHELL + AliMeeting | General |
+| **Soft fillers** | Good, but 200ms chunks may blur boundaries | Catches at low threshold (~0.1–0.3) | May miss quiet fillers | Best accuracy on soft boundaries | Less documented |
+| **Mac perf** | CPU, fast for 0.4M | ~40μs/chunk on M2 Max | Very fast CPU | Slow on CPU (68M params) | arm64 native + ONNX |
+| **MPS/GPU** | Yes (via FunASR) | MLX native | N/A (CPU only) | MPS supported | ONNX only |
+| **License** | Model-specific (check HF) | MIT | BSD | MIT | Apache 2.0 |
+
+## Analysis
+
+### FSMN-VAD
+- **Pro**: Already in our stack (FunASR). Chinese-native, production-proven.
+- **Con**: 200ms chunk size is coarser than ideal. Fine boundary detection
+  for short fillers (150–400ms) may lose precision.
+
+### Silero VAD
+- **Pro**: Tiny, fast, 10ms resolution, MIT license. Threshold tunable to
+  ~0.1 for soft speech. MLX native on Apple Silicon.
+- **Con**: Not Chinese-optimized. Lightweight design means less sophisticated
+  boundary detection.
+
+### WebRTC VAD
+- **Pro**: Mature, fast, 10ms native resolution.
+- **Con**: No probability output — binary decisions with coarse aggressiveness
+  levels (0–3). Hard to tune for soft fillers. No fine-grained threshold.
+
+### pyannote.audio (segmentation-3.0)
+- **Pro**: Best accuracy. 10ms resolution. Trained on Chinese datasets
+  (AISHELL, AliMeeting). Best at catching soft speech boundaries.
+- **Con**: 68M params — slow on CPU. Overkill if we only need binary VAD.
+
+### TEN VAD
+- **Pro**: 10ms resolution, superior precision vs WebRTC and Silero. Apache 2.0.
+- **Con**: Newer (2024–2025), fewer production deployments. Less documented
+  for Chinese/soft speech.
+
+## Recommendation
+
+**Silero VAD** as primary choice:
+- 10ms resolution matches what the paper uses
+- Threshold tunable to 0.1 (critical finding from the paper)
+- Tiny and fast on MacBook
+- MIT license
+- Good enough for candidate generation — the classifier stage handles precision
+
+**FSMN-VAD** as comparison baseline since it's already in our pipeline.
+
+If accuracy on soft boundaries proves insufficient, upgrade to **pyannote.audio**.
+
+## References
+
+- [PodcastFillers paper](../refs/2203.pdf) — Section 2.2: VAD threshold 0.1, candidates 150ms–2s
+- [Silero VAD](https://github.com/snakers4/silero-vad)
+- [FSMN-VAD](https://huggingface.co/funasr/fsmn-vad)
+- [pyannote.audio](https://github.com/pyannote/pyannote-audio)
+- [TEN VAD](https://github.com/TEN-framework/ten-vad)
diff --git a/training/smart-turn-zh/docs/02-data-structures.md b/training/smart-turn-zh/docs/02-data-structures.md
@@ -0,0 +1,114 @@
+# Data Structures
+
+Artifacts produced by the notebook pipeline (`01-asr-transcribe`, `02-vad`) and their schemas.
+
+## Directory Layout
+
+```
+data/
+├── wav/                          # Source audio
+│   ├── R8001_M8004_MS801.wav     # 8-ch, 16 kHz, PCM-16
+│   └── R8003_M8001_MS801.wav
+├── asr_results/                  # Per-file ASR transcriptions
+│   ├── R8001_M8004_MS801.json
+│   └── R8003_M8001_MS801.json
+└── vad_probs/                    # Per-frame speech probabilities
+    ├── R8001_M8004_MS801.npy
+    └── R8003_M8001_MS801.npy
+```
+
+## Source Audio (`data/wav/*.wav`)
+
+| Property    | Value                          |
+|-------------|--------------------------------|
+| Format      | RIFF WAVE, PCM 16-bit          |
+| Channels    | 8 (per-speaker headset mics)   |
+| Sample rate | 16 kHz                         |
+| Source      | AliMeeting (SLR-119) meetings  |
+
+## ASR Results (`data/asr_results/*.json`)
+
+**Format**: JSON — one file per WAV, named `{wav_stem}.json`.
+**Producer**: `notebooks/01-asr-transcribe.ipynb` (Paraformer-zh + FSMN-VAD + ct-punc).
+
+### File Schema
+
+Each file contains a JSON array of record objects:
+
+```jsonc
+[
+  {
+    "text": "全文转写结果...",            // full transcription (punctuated)
+    "sentences": [ /* see below */ ],
+    "timestamp": [ /* see below */ ]
+  }
+]
+```
+
+### `sentences` Array Element
+
+Each element is one sentence/chunk segmented by the ASR model.
+
+```jsonc
+{
+  "text":      "啊，",         // punctuated text
+  "raw_text":  "啊",           // text without punctuation
+  "start":     7130,           // start time (ms)
+  "end":       7370,           // end time (ms)
+  "timestamp": [[7130, 7370]]  // per-word [start_ms, end_ms] pairs
+}
+```
+
+| Field       | Type             | Unit | Description                              |
+|-------------|------------------|------|------------------------------------------|
+| `text`      | string           | —    | Sentence with restored punctuation       |
+| `raw_text`  | string           | —    | Same sentence, no punctuation            |
+| `start`     | int              | ms   | Sentence start time                      |
+| `end`       | int              | ms   | Sentence end time                        |
+| `timestamp` | array of [int, int] | ms | Per-word start/end pairs (10 ms frames) |
+
+### Top-level `timestamp` Array
+
+Flat array of all per-word `[start_ms, end_ms]` pairs across the entire file (same data as the union of per-sentence timestamps).
+
+## VAD Probabilities (`data/vad_probs/*.npy`)
+
+**Format**: NumPy `.npy`, 1-D `float32` array.
+**Producer**: `notebooks/02-vad.ipynb` (Silero VAD).
+
+| Property          | Value                         |
+|-------------------|-------------------------------|
+| Shape             | `(num_frames,)`               |
+| Dtype             | `float32`                     |
+| Frame size        | 512 samples = **32 ms** @ 16 kHz |
+| Value range       | `[0.0, 1.0]` — P(speech)     |
+| Index → time      | `frame[i]` → `i * 32 ms`     |
+
+### Loading
+
+```python
+import numpy as np
+probs = np.load("data/vad_probs/R8001_M8004_MS801.npy")
+# probs[i] = speech probability at time i * 32 ms
+```
+
+### File Details
+
+| File                      | Frames  | Duration   |
+|---------------------------|---------|------------|
+| `R8001_M8004_MS801.npy`   | 49,183  | ~1,573.9 s |
+| `R8003_M8001_MS801.npy`   | 64,625  | ~2,068.0 s |
+
+## Cross-referencing ASR and VAD
+
+ASR timestamps are in **milliseconds**; VAD frames are **32 ms** each.
+
+```python
+# Convert ASR ms timestamp to VAD frame index
+vad_frame = asr_start_ms // 32
+
+# Convert VAD frame index to ms
+time_ms = vad_frame * 32
+```
+
+This alignment is used in the next pipeline step (filler candidate extraction): regions where VAD is active (`prob > threshold`) but ASR produces no recognized words.