test(parity): rust runner for silero VAD parity harness#5
Merged
Conversation
Adds `tests/parity/Cargo.toml` and `src/main.rs` for the `silero-parity-runner` binary that loads a 16 kHz mono WAV via `ffmpeg-next`, runs `silero::detect_speech` with the bundled ONNX model, and emits a JSON segment list. Pairs with the Python runner (next commit) for side-by-side comparison against upstream `silero-vad`. The runner uses `ffmpeg-next` (not `hound`) for audio loading so the f32 buffer the model sees is byte-identical to what upstream Python silero-vad consumes via torchaudio/ffmpeg, letting the parity score verify both runners decoded the audio the same way before flagging any output divergence as a model issue. Same pattern whispery's parity harness uses. Cargo.lock for the parity binary lives at `tests/parity/Cargo.lock` and is gitignored via the existing `Cargo.lock` rule. The parity crate is excluded from `cargo package` because nested `Cargo.toml` workspaces are skipped automatically. test(parity): python runner + scorer for upstream silero-vad Adds the upstream Python `silero-vad` reference side of the parity harness: - `python/pyproject.toml`: pins `silero-vad >= 5.1` and `onnxruntime >= 1.18` (the `load_silero_vad(onnx=True)` path needs onnxruntime). - `python/silero_vad_runner.py`: same CLI / same JSON schema as the Rust runner. Defaults to `--backend onnx` so both runners feed byte-identical bytes to ORT — same model file, same backend — and any IoU disagreement is segmenter logic, not the inference runtime. `--backend jit` available for measuring runtime drift separately. Audio loading is an ffmpeg shell-out matching WhisperX's `load_audio` (`pcm_s16le -ac 1 -ar 16000` → `np.float32 / 32768.0`), same byte path as the Rust loader. - `python/score.py`: sequence-position pairing, per-segment IoU, median + p10/p90 + worst-N report, JSON summary on stdout / `--out`. Pass/fail: median IoU >= 0.95 AND segment counts match. test(parity): driver script and README `run.sh` brings up the uv venv, runs both runners, and pipes the JSONs through `score.py`. Accepts a fixture directory (uses `clip_16k.wav` inside) or a direct WAV path. `run.sh` passes `--min-silence-ms 132` to the Rust side as a parity override (NOT a crate-default change). The silero crate's `SpeechSegmenter::push_probability` computes `silence_samples = current_sample - silence_start` AFTER the current frame's increment, while upstream Python silero-vad computes it BEFORE — a one-frame (32 ms at 16 kHz / 512-sample windows) off-by-one that causes the crate to close segments after 4 low-prob frames vs Python's 5. Bumping the override by exactly one frame on the crate side restores byte-identical segment counts (verified on all 5 dia parity fixtures: median IoU 1.0000 across the board). README documents the layout, prerequisites (cargo + uv + ffmpeg, no ORT_DYLIB_PATH needed because `ort 2.0.0-rc.12`'s default features include `download-binaries` + `copy-dylibs`), the canonical fixture set (dia's parity fixtures, intentionally not copied), the parameter alignment table (named defaults match upstream silero-vad 6.2.1 exactly), and the off-by-one silence threshold finding in detail. fix(detector): silence threshold counter matches upstream silero-vad The silence-counter in `SpeechSegmenter::push_probability` evaluated `silence_samples = current_sample - silence_start` AFTER the current frame's contribution had been added to `current_sample`, while upstream Python `silero-vad` evaluates the equivalent `sil_dur_now = cur_sample - temp_end` BEFORE the current frame is consumed. The crate's counter therefore fired one model frame (32 ms at 16 kHz / 512-sample windows) early — at the default `min_silence_duration_ms = 100`, the crate closed a segment after 4 consecutive low-probability frames where Python tolerates the dip and closes after 5. Switch the comparator to `frame_start.saturating_sub(silence_start)`, mirroring Python's `cur_sample - temp_end` evaluated before the frame is consumed. Same correction applies to the `min_silence_at_max_speech_samples` comparator on the same line block, which used the same off-by-one counter. Audit existing tests: - `middle_band_frames_do_not_reset_tentative_end` and `min_speech_duration_is_checked_before_padding` extended their trailing silence runs by one frame so the close still fires via `push_probability` under the corrected counter. - `force_split_during_silence_closes_without_restarting` raised its `max_speech_duration` ceiling by one frame so the max-speech split still fires after `max_split_end` has been recorded by the silence-counter logic. - All updated tests document the change in their docstrings, citing the parity harness in `tests/parity/` as motivation. Add two new regression tests: - `four_frame_silence_dip_does_not_close_segment_at_default_min_silence` pins that a 4-frame (128 ms) silence dip is now tolerated. - `five_frame_silence_dip_closes_segment_at_default_min_silence` pins that 5 consecutive low-prob frames still close, matching upstream. Discovered by the parity harness in commits dd64c35 / 003e8b6 / da8c0de. test(parity): drop now-unneeded min-silence override The `--min-silence-ms 132` workaround in `run.sh` was compensating for an off-by-one in `SpeechSegmenter::push_probability`'s silence counter (silero v0.2.x evaluated `silence_samples = current_sample - silence_start` after the current frame's increment, while Python evaluates the equivalent `cur_sample - temp_end` before the current frame is consumed). The crate fix in the previous commit aligns the two semantics, so the runner now uses upstream silero-vad defaults verbatim and parity numbers are unchanged. Verified on all 5 short fixtures (01_dialogue, 02_pyannote_sample, 03_dual_speaker, 04_three_speaker, 05_four_speaker): median IoU 1.0000 and segment counts match exactly (51/51, 4/4, 14/14, 6/6, 14/14) without the override — same numbers the override produced pre-fix. The `--min-silence-ms` flag remains on the runner CLI for advanced users who want to override per-run; only `run.sh` no longer applies it. README updated to mark the off-by-one finding as fixed in v0.3.0 and preserve the previous analysis as historical context. chore: bump to 0.3.0 with CHANGELOG entry for the silence threshold fix The silence-counter fix in `SpeechSegmenter::push_probability` is a behaviour change for any caller that hand-tuned `min_silence_duration_ms` against v0.2.x's response curve. Bumping the minor version (0.2.x → 0.3.0) signals that even though it's strictly a bug fix, the new response curve may require re-tuning at the call site. Default callers do not need to change anything. CHANGELOG entry covers what changed (silence-counter semantics now match upstream Python silero-vad), why (parity harness uncovered an off-by-one), and the migration note (subtract ~32 ms from hand-tuned `min_silence_duration_ms` overrides if you want to keep the v0.2.x effective behaviour).
There was a problem hiding this comment.
Pull request overview
Adds a manual parity harness (Rust + Python) to compare silero’s VAD segmentation against upstream silero-vad, and aligns the Rust segmenter’s silence counter semantics with upstream Python (shipping as v0.3.0).
Changes:
- Introduce
tests/parity/runner tooling: Rustsilero-parity-runner, Python reference runner + IoU scorer, plus a driver script and README. - Fix an off-by-one in
SpeechSegmenter::push_probabilitysilence accounting and update/add regression tests. - Bump crate version to
0.3.0and document the behavior change inCHANGELOG.md.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/parity/src/main.rs | New Rust parity runner: ffmpeg decode → detect_speech → JSON output |
| tests/parity/Cargo.toml | New standalone Cargo package for the parity runner |
| tests/parity/run.sh | Driver script to run Rust + Python runners and score results |
| tests/parity/python/pyproject.toml | Python env definition for upstream silero-vad reference runner |
| tests/parity/python/silero_vad_runner.py | Python reference runner emitting the same JSON schema as Rust |
| tests/parity/python/score.py | IoU scoring and pass/fail logic for runner outputs |
| tests/parity/README.md | Harness documentation, parameter alignment, and historical off-by-one notes |
| src/detector.rs | Silence-counter semantic fix + updated and new regression tests |
| Cargo.toml | Version bump to 0.3.0 |
| CHANGELOG.md | Release notes for the silence-counter behavior change |
| .gitignore | Ignore parity harness outputs and Python venv artifacts |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Out-of-tree harnesses (the parity runner being the immediate caller)
need a way to record the silero crate version they're exercising. Using
`env!("CARGO_PKG_VERSION")` in those harnesses resolves to the harness
binary's own version, not silero's. Re-exporting the version string from
the library lets callers depend on `silero::VERSION` and get the value
that actually matches the running detector logic.
Surfaced by Copilot PR review of #5
(#5 (comment)).
Two issues from Copilot's PR review of #5: 1. `SILERO_CRATE_VERSION` was sourced from `env!("CARGO_PKG_VERSION")`, which in this binary resolves to the parity runner's own version (`0.0.0`) — not the silero crate version under test. The JSON output was misreporting `silero_crate_version`. Now uses the newly-exposed `silero::VERSION` constant so the JSON records the actual crate version being validated. (#5 (comment)) 2. `ffmpeg_init` stored its initialisation error in a stack-local that was only assigned inside the `Once::call_once` closure. After the first failed call, subsequent invocations would silently return `Ok(())` because the closure no longer ran and the local was always `None`. Switched to a static `OnceLock<Result<(), String>>` so the init outcome is captured once and re-surfaced on every subsequent call — the function now actually behaves idempotently. (#5 (comment)) Verified: `cargo build --release` clean; parity smoke test on `02_pyannote_sample` reports `silero_crate_version=0.3.0`, median IoU 1.0000.
The runner explicitly sets `max_speech_duration_s = math.inf` when `--max-speech-s` is omitted (so the call site records the value actually passed to `get_speech_timestamps`), but the emitted JSON was reading `args.max_speech_s` — i.e., `null` — which contradicted the inline comment and made the output non-self-describing. Switched to writing `kwargs["max_speech_duration_s"]` so the JSON records the effective value (the float when the user provided one, otherwise `math.inf`). Python's `json.dumps` emits the latter as `Infinity`; documented inline that downstream parsers may need non-strict JSON support if they read this field from outside the Python ecosystem. Surfaced by Copilot PR review of #5 (#5 (comment)). Verified: parity smoke test on `02_pyannote_sample` shows `"max_speech_s": Infinity` in the Python JSON; median IoU still 1.0000.
uqio
pushed a commit
that referenced
this pull request
May 3, 2026
* test(parity): rust runner for silero VAD parity harness
Adds `tests/parity/Cargo.toml` and `src/main.rs` for the
`silero-parity-runner` binary that loads a 16 kHz mono WAV via
`ffmpeg-next`, runs `silero::detect_speech` with the bundled ONNX
model, and emits a JSON segment list. Pairs with the Python runner
(next commit) for side-by-side comparison against upstream
`silero-vad`.
The runner uses `ffmpeg-next` (not `hound`) for audio loading so the
f32 buffer the model sees is byte-identical to what upstream Python
silero-vad consumes via torchaudio/ffmpeg, letting the parity score
verify both runners decoded the audio the same way before flagging
any output divergence as a model issue. Same pattern whispery's
parity harness uses.
Cargo.lock for the parity binary lives at `tests/parity/Cargo.lock`
and is gitignored via the existing `Cargo.lock` rule. The parity
crate is excluded from `cargo package` because nested `Cargo.toml`
workspaces are skipped automatically.
test(parity): python runner + scorer for upstream silero-vad
Adds the upstream Python `silero-vad` reference side of the parity
harness:
- `python/pyproject.toml`: pins `silero-vad >= 5.1` and `onnxruntime
>= 1.18` (the `load_silero_vad(onnx=True)` path needs onnxruntime).
- `python/silero_vad_runner.py`: same CLI / same JSON schema as the
Rust runner. Defaults to `--backend onnx` so both runners feed
byte-identical bytes to ORT — same model file, same backend — and
any IoU disagreement is segmenter logic, not the inference runtime.
`--backend jit` available for measuring runtime drift separately.
Audio loading is an ffmpeg shell-out matching WhisperX's
`load_audio` (`pcm_s16le -ac 1 -ar 16000` → `np.float32 / 32768.0`),
same byte path as the Rust loader.
- `python/score.py`: sequence-position pairing, per-segment IoU,
median + p10/p90 + worst-N report, JSON summary on stdout / `--out`.
Pass/fail: median IoU >= 0.95 AND segment counts match.
test(parity): driver script and README
`run.sh` brings up the uv venv, runs both runners, and pipes the JSONs
through `score.py`. Accepts a fixture directory (uses
`clip_16k.wav` inside) or a direct WAV path.
`run.sh` passes `--min-silence-ms 132` to the Rust side as a parity
override (NOT a crate-default change). The silero crate's
`SpeechSegmenter::push_probability` computes
`silence_samples = current_sample - silence_start` AFTER the current
frame's increment, while upstream Python silero-vad computes it
BEFORE — a one-frame (32 ms at 16 kHz / 512-sample windows) off-by-one
that causes the crate to close segments after 4 low-prob frames vs
Python's 5. Bumping the override by exactly one frame on the crate
side restores byte-identical segment counts (verified on all 5 dia
parity fixtures: median IoU 1.0000 across the board).
README documents the layout, prerequisites (cargo + uv + ffmpeg, no
ORT_DYLIB_PATH needed because `ort 2.0.0-rc.12`'s default features
include `download-binaries` + `copy-dylibs`), the canonical fixture
set (dia's parity fixtures, intentionally not copied), the parameter
alignment table (named defaults match upstream silero-vad 6.2.1
exactly), and the off-by-one silence threshold finding in detail.
fix(detector): silence threshold counter matches upstream silero-vad
The silence-counter in `SpeechSegmenter::push_probability` evaluated
`silence_samples = current_sample - silence_start` AFTER the current
frame's contribution had been added to `current_sample`, while upstream
Python `silero-vad` evaluates the equivalent `sil_dur_now =
cur_sample - temp_end` BEFORE the current frame is consumed. The
crate's counter therefore fired one model frame (32 ms at 16 kHz /
512-sample windows) early — at the default
`min_silence_duration_ms = 100`, the crate closed a segment after 4
consecutive low-probability frames where Python tolerates the dip and
closes after 5.
Switch the comparator to `frame_start.saturating_sub(silence_start)`,
mirroring Python's `cur_sample - temp_end` evaluated before the frame
is consumed. Same correction applies to the
`min_silence_at_max_speech_samples` comparator on the same line block,
which used the same off-by-one counter.
Audit existing tests:
- `middle_band_frames_do_not_reset_tentative_end` and
`min_speech_duration_is_checked_before_padding` extended their
trailing silence runs by one frame so the close still fires via
`push_probability` under the corrected counter.
- `force_split_during_silence_closes_without_restarting` raised its
`max_speech_duration` ceiling by one frame so the max-speech split
still fires after `max_split_end` has been recorded by the
silence-counter logic.
- All updated tests document the change in their docstrings, citing
the parity harness in `tests/parity/` as motivation.
Add two new regression tests:
- `four_frame_silence_dip_does_not_close_segment_at_default_min_silence`
pins that a 4-frame (128 ms) silence dip is now tolerated.
- `five_frame_silence_dip_closes_segment_at_default_min_silence` pins
that 5 consecutive low-prob frames still close, matching upstream.
Discovered by the parity harness in commits dd64c35 / 003e8b6 /
da8c0de.
test(parity): drop now-unneeded min-silence override
The `--min-silence-ms 132` workaround in `run.sh` was compensating
for an off-by-one in `SpeechSegmenter::push_probability`'s silence
counter (silero v0.2.x evaluated `silence_samples = current_sample -
silence_start` after the current frame's increment, while Python
evaluates the equivalent `cur_sample - temp_end` before the current
frame is consumed). The crate fix in the previous commit aligns the
two semantics, so the runner now uses upstream silero-vad defaults
verbatim and parity numbers are unchanged.
Verified on all 5 short fixtures (01_dialogue, 02_pyannote_sample,
03_dual_speaker, 04_three_speaker, 05_four_speaker): median IoU
1.0000 and segment counts match exactly (51/51, 4/4, 14/14, 6/6,
14/14) without the override — same numbers the override produced
pre-fix.
The `--min-silence-ms` flag remains on the runner CLI for advanced
users who want to override per-run; only `run.sh` no longer applies
it. README updated to mark the off-by-one finding as fixed in v0.3.0
and preserve the previous analysis as historical context.
chore: bump to 0.3.0 with CHANGELOG entry for the silence threshold fix
The silence-counter fix in `SpeechSegmenter::push_probability` is a
behaviour change for any caller that hand-tuned
`min_silence_duration_ms` against v0.2.x's response curve. Bumping the
minor version (0.2.x → 0.3.0) signals that even though it's strictly
a bug fix, the new response curve may require re-tuning at the call
site. Default callers do not need to change anything.
CHANGELOG entry covers what changed (silence-counter semantics now
match upstream Python silero-vad), why (parity harness uncovered an
off-by-one), and the migration note (subtract ~32 ms from
hand-tuned `min_silence_duration_ms` overrides if you want to keep
the v0.2.x effective behaviour).
* feat(lib): expose `silero::VERSION` as a public constant
Out-of-tree harnesses (the parity runner being the immediate caller)
need a way to record the silero crate version they're exercising. Using
`env!("CARGO_PKG_VERSION")` in those harnesses resolves to the harness
binary's own version, not silero's. Re-exporting the version string from
the library lets callers depend on `silero::VERSION` and get the value
that actually matches the running detector logic.
Surfaced by Copilot PR review of #5
(#5 (comment)).
* fix(parity): record silero crate version + persist ffmpeg init outcome
Two issues from Copilot's PR review of #5:
1. `SILERO_CRATE_VERSION` was sourced from `env!("CARGO_PKG_VERSION")`,
which in this binary resolves to the parity runner's own version
(`0.0.0`) — not the silero crate version under test. The JSON output
was misreporting `silero_crate_version`. Now uses the newly-exposed
`silero::VERSION` constant so the JSON records the actual crate
version being validated.
(#5 (comment))
2. `ffmpeg_init` stored its initialisation error in a stack-local that
was only assigned inside the `Once::call_once` closure. After the
first failed call, subsequent invocations would silently return
`Ok(())` because the closure no longer ran and the local was always
`None`. Switched to a static `OnceLock<Result<(), String>>` so the
init outcome is captured once and re-surfaced on every subsequent
call — the function now actually behaves idempotently.
(#5 (comment))
Verified: `cargo build --release` clean; parity smoke test on
`02_pyannote_sample` reports `silero_crate_version=0.3.0`, median IoU
1.0000.
* fix(parity): record effective max_speech_duration_s in Python output
The runner explicitly sets `max_speech_duration_s = math.inf` when
`--max-speech-s` is omitted (so the call site records the value actually
passed to `get_speech_timestamps`), but the emitted JSON was reading
`args.max_speech_s` — i.e., `null` — which contradicted the inline
comment and made the output non-self-describing.
Switched to writing `kwargs["max_speech_duration_s"]` so the JSON
records the effective value (the float when the user provided one,
otherwise `math.inf`). Python's `json.dumps` emits the latter as
`Infinity`; documented inline that downstream parsers may need
non-strict JSON support if they read this field from outside the
Python ecosystem.
Surfaced by Copilot PR review of #5
(#5 (comment)).
Verified: parity smoke test on `02_pyannote_sample` shows
`"max_speech_s": Infinity` in the Python JSON; median IoU still 1.0000.
---------
uqio
added a commit
that referenced
this pull request
May 3, 2026
* test(parity): rust runner for silero VAD parity harness
Adds `tests/parity/Cargo.toml` and `src/main.rs` for the
`silero-parity-runner` binary that loads a 16 kHz mono WAV via
`ffmpeg-next`, runs `silero::detect_speech` with the bundled ONNX
model, and emits a JSON segment list. Pairs with the Python runner
(next commit) for side-by-side comparison against upstream
`silero-vad`.
The runner uses `ffmpeg-next` (not `hound`) for audio loading so the
f32 buffer the model sees is byte-identical to what upstream Python
silero-vad consumes via torchaudio/ffmpeg, letting the parity score
verify both runners decoded the audio the same way before flagging
any output divergence as a model issue. Same pattern whispery's
parity harness uses.
Cargo.lock for the parity binary lives at `tests/parity/Cargo.lock`
and is gitignored via the existing `Cargo.lock` rule. The parity
crate is excluded from `cargo package` because nested `Cargo.toml`
workspaces are skipped automatically.
test(parity): python runner + scorer for upstream silero-vad
Adds the upstream Python `silero-vad` reference side of the parity
harness:
- `python/pyproject.toml`: pins `silero-vad >= 5.1` and `onnxruntime
>= 1.18` (the `load_silero_vad(onnx=True)` path needs onnxruntime).
- `python/silero_vad_runner.py`: same CLI / same JSON schema as the
Rust runner. Defaults to `--backend onnx` so both runners feed
byte-identical bytes to ORT — same model file, same backend — and
any IoU disagreement is segmenter logic, not the inference runtime.
`--backend jit` available for measuring runtime drift separately.
Audio loading is an ffmpeg shell-out matching WhisperX's
`load_audio` (`pcm_s16le -ac 1 -ar 16000` → `np.float32 / 32768.0`),
same byte path as the Rust loader.
- `python/score.py`: sequence-position pairing, per-segment IoU,
median + p10/p90 + worst-N report, JSON summary on stdout / `--out`.
Pass/fail: median IoU >= 0.95 AND segment counts match.
test(parity): driver script and README
`run.sh` brings up the uv venv, runs both runners, and pipes the JSONs
through `score.py`. Accepts a fixture directory (uses
`clip_16k.wav` inside) or a direct WAV path.
`run.sh` passes `--min-silence-ms 132` to the Rust side as a parity
override (NOT a crate-default change). The silero crate's
`SpeechSegmenter::push_probability` computes
`silence_samples = current_sample - silence_start` AFTER the current
frame's increment, while upstream Python silero-vad computes it
BEFORE — a one-frame (32 ms at 16 kHz / 512-sample windows) off-by-one
that causes the crate to close segments after 4 low-prob frames vs
Python's 5. Bumping the override by exactly one frame on the crate
side restores byte-identical segment counts (verified on all 5 dia
parity fixtures: median IoU 1.0000 across the board).
README documents the layout, prerequisites (cargo + uv + ffmpeg, no
ORT_DYLIB_PATH needed because `ort 2.0.0-rc.12`'s default features
include `download-binaries` + `copy-dylibs`), the canonical fixture
set (dia's parity fixtures, intentionally not copied), the parameter
alignment table (named defaults match upstream silero-vad 6.2.1
exactly), and the off-by-one silence threshold finding in detail.
fix(detector): silence threshold counter matches upstream silero-vad
The silence-counter in `SpeechSegmenter::push_probability` evaluated
`silence_samples = current_sample - silence_start` AFTER the current
frame's contribution had been added to `current_sample`, while upstream
Python `silero-vad` evaluates the equivalent `sil_dur_now =
cur_sample - temp_end` BEFORE the current frame is consumed. The
crate's counter therefore fired one model frame (32 ms at 16 kHz /
512-sample windows) early — at the default
`min_silence_duration_ms = 100`, the crate closed a segment after 4
consecutive low-probability frames where Python tolerates the dip and
closes after 5.
Switch the comparator to `frame_start.saturating_sub(silence_start)`,
mirroring Python's `cur_sample - temp_end` evaluated before the frame
is consumed. Same correction applies to the
`min_silence_at_max_speech_samples` comparator on the same line block,
which used the same off-by-one counter.
Audit existing tests:
- `middle_band_frames_do_not_reset_tentative_end` and
`min_speech_duration_is_checked_before_padding` extended their
trailing silence runs by one frame so the close still fires via
`push_probability` under the corrected counter.
- `force_split_during_silence_closes_without_restarting` raised its
`max_speech_duration` ceiling by one frame so the max-speech split
still fires after `max_split_end` has been recorded by the
silence-counter logic.
- All updated tests document the change in their docstrings, citing
the parity harness in `tests/parity/` as motivation.
Add two new regression tests:
- `four_frame_silence_dip_does_not_close_segment_at_default_min_silence`
pins that a 4-frame (128 ms) silence dip is now tolerated.
- `five_frame_silence_dip_closes_segment_at_default_min_silence` pins
that 5 consecutive low-prob frames still close, matching upstream.
Discovered by the parity harness in commits dd64c35 / 003e8b6 /
da8c0de.
test(parity): drop now-unneeded min-silence override
The `--min-silence-ms 132` workaround in `run.sh` was compensating
for an off-by-one in `SpeechSegmenter::push_probability`'s silence
counter (silero v0.2.x evaluated `silence_samples = current_sample -
silence_start` after the current frame's increment, while Python
evaluates the equivalent `cur_sample - temp_end` before the current
frame is consumed). The crate fix in the previous commit aligns the
two semantics, so the runner now uses upstream silero-vad defaults
verbatim and parity numbers are unchanged.
Verified on all 5 short fixtures (01_dialogue, 02_pyannote_sample,
03_dual_speaker, 04_three_speaker, 05_four_speaker): median IoU
1.0000 and segment counts match exactly (51/51, 4/4, 14/14, 6/6,
14/14) without the override — same numbers the override produced
pre-fix.
The `--min-silence-ms` flag remains on the runner CLI for advanced
users who want to override per-run; only `run.sh` no longer applies
it. README updated to mark the off-by-one finding as fixed in v0.3.0
and preserve the previous analysis as historical context.
chore: bump to 0.3.0 with CHANGELOG entry for the silence threshold fix
The silence-counter fix in `SpeechSegmenter::push_probability` is a
behaviour change for any caller that hand-tuned
`min_silence_duration_ms` against v0.2.x's response curve. Bumping the
minor version (0.2.x → 0.3.0) signals that even though it's strictly
a bug fix, the new response curve may require re-tuning at the call
site. Default callers do not need to change anything.
CHANGELOG entry covers what changed (silence-counter semantics now
match upstream Python silero-vad), why (parity harness uncovered an
off-by-one), and the migration note (subtract ~32 ms from
hand-tuned `min_silence_duration_ms` overrides if you want to keep
the v0.2.x effective behaviour).
* feat(lib): expose `silero::VERSION` as a public constant
Out-of-tree harnesses (the parity runner being the immediate caller)
need a way to record the silero crate version they're exercising. Using
`env!("CARGO_PKG_VERSION")` in those harnesses resolves to the harness
binary's own version, not silero's. Re-exporting the version string from
the library lets callers depend on `silero::VERSION` and get the value
that actually matches the running detector logic.
Surfaced by Copilot PR review of #5
(#5 (comment)).
* fix(parity): record silero crate version + persist ffmpeg init outcome
Two issues from Copilot's PR review of #5:
1. `SILERO_CRATE_VERSION` was sourced from `env!("CARGO_PKG_VERSION")`,
which in this binary resolves to the parity runner's own version
(`0.0.0`) — not the silero crate version under test. The JSON output
was misreporting `silero_crate_version`. Now uses the newly-exposed
`silero::VERSION` constant so the JSON records the actual crate
version being validated.
(#5 (comment))
2. `ffmpeg_init` stored its initialisation error in a stack-local that
was only assigned inside the `Once::call_once` closure. After the
first failed call, subsequent invocations would silently return
`Ok(())` because the closure no longer ran and the local was always
`None`. Switched to a static `OnceLock<Result<(), String>>` so the
init outcome is captured once and re-surfaced on every subsequent
call — the function now actually behaves idempotently.
(#5 (comment))
Verified: `cargo build --release` clean; parity smoke test on
`02_pyannote_sample` reports `silero_crate_version=0.3.0`, median IoU
1.0000.
* fix(parity): record effective max_speech_duration_s in Python output
The runner explicitly sets `max_speech_duration_s = math.inf` when
`--max-speech-s` is omitted (so the call site records the value actually
passed to `get_speech_timestamps`), but the emitted JSON was reading
`args.max_speech_s` — i.e., `null` — which contradicted the inline
comment and made the output non-self-describing.
Switched to writing `kwargs["max_speech_duration_s"]` so the JSON
records the effective value (the float when the user provided one,
otherwise `math.inf`). Python's `json.dumps` emits the latter as
`Infinity`; documented inline that downstream parsers may need
non-strict JSON support if they read this field from outside the
Python ecosystem.
Surfaced by Copilot PR review of #5
(#5 (comment)).
Verified: parity smoke test on `02_pyannote_sample` shows
`"max_speech_s": Infinity` in the Python JSON; median IoU still 1.0000.
---------
uqio
added a commit
that referenced
this pull request
May 3, 2026
Folds the post-review additions into the unreleased 0.3.0 entry: - ### Added: `silero::VERSION` public constant. - ### Added: `tests/parity/` harness (was previously only described under Verified, not Added). - ### Fixed: three parity-harness bugs surfaced by the Copilot PR review on #5 — ffmpeg-init error swallowing, parity runner reporting its own version instead of silero's, Python runner emitting `null` instead of the effective `max_speech_duration_s`.
uqio
added a commit
that referenced
this pull request
May 3, 2026
Folds the post-review additions into the unreleased 0.3.0 entry: - ### Added: `silero::VERSION` public constant. - ### Added: `tests/parity/` harness (was previously only described under Verified, not Added). - ### Fixed: three parity-harness bugs surfaced by the Copilot PR review on #5 — ffmpeg-init error swallowing, parity runner reporting its own version instead of silero's, Python runner emitting `null` instead of the effective `max_speech_duration_s`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
tests/parity/Cargo.tomlandsrc/main.rsfor thesilero-parity-runnerbinary that loads a 16 kHz mono WAV viaffmpeg-next, runssilero::detect_speechwith the bundled ONNX model, and emits a JSON segment list. Pairs with the Python runner (next commit) for side-by-side comparison against upstreamsilero-vad.The runner uses
ffmpeg-next(nothound) for audio loading so the f32 buffer the model sees is byte-identical to what upstream Python silero-vad consumes via torchaudio/ffmpeg, letting the parity score verify both runners decoded the audio the same way before flagging any output divergence as a model issue. Same pattern whispery's parity harness uses.Cargo.lock for the parity binary lives at
tests/parity/Cargo.lockand is gitignored via the existingCargo.lockrule. The parity crate is excluded fromcargo packagebecause nestedCargo.tomlworkspaces are skipped automatically.test(parity): python runner + scorer for upstream silero-vad
Adds the upstream Python
silero-vadreference side of the parity harness:python/pyproject.toml: pinssilero-vad >= 5.1and `onnxruntimepython/silero_vad_runner.py: same CLI / same JSON schema as the Rust runner. Defaults to--backend onnxso both runners feed byte-identical bytes to ORT — same model file, same backend — and any IoU disagreement is segmenter logic, not the inference runtime.--backend jitavailable for measuring runtime drift separately. Audio loading is an ffmpeg shell-out matching WhisperX'sload_audio(pcm_s16le -ac 1 -ar 16000→np.float32 / 32768.0), same byte path as the Rust loader.python/score.py: sequence-position pairing, per-segment IoU, median + p10/p90 + worst-N report, JSON summary on stdout /--out. Pass/fail: median IoU >= 0.95 AND segment counts match.test(parity): driver script and README
run.shbrings up the uv venv, runs both runners, and pipes the JSONs throughscore.py. Accepts a fixture directory (usesclip_16k.wavinside) or a direct WAV path.run.shpasses--min-silence-ms 132to the Rust side as a parity override (NOT a crate-default change). The silero crate'sSpeechSegmenter::push_probabilitycomputessilence_samples = current_sample - silence_startAFTER the current frame's increment, while upstream Python silero-vad computes it BEFORE — a one-frame (32 ms at 16 kHz / 512-sample windows) off-by-one that causes the crate to close segments after 4 low-prob frames vs Python's 5. Bumping the override by exactly one frame on the crate side restores byte-identical segment counts (verified on all 5 dia parity fixtures: median IoU 1.0000 across the board).README documents the layout, prerequisites (cargo + uv + ffmpeg, no ORT_DYLIB_PATH needed because
ort 2.0.0-rc.12's default features includedownload-binaries+copy-dylibs), the canonical fixture set (dia's parity fixtures, intentionally not copied), the parameter alignment table (named defaults match upstream silero-vad 6.2.1 exactly), and the off-by-one silence threshold finding in detail.fix(detector): silence threshold counter matches upstream silero-vad
The silence-counter in
SpeechSegmenter::push_probabilityevaluatedsilence_samples = current_sample - silence_startAFTER the current frame's contribution had been added tocurrent_sample, while upstream Pythonsilero-vadevaluates the equivalentsil_dur_now = cur_sample - temp_endBEFORE the current frame is consumed. The crate's counter therefore fired one model frame (32 ms at 16 kHz / 512-sample windows) early — at the defaultmin_silence_duration_ms = 100, the crate closed a segment after 4 consecutive low-probability frames where Python tolerates the dip and closes after 5.Switch the comparator to
frame_start.saturating_sub(silence_start), mirroring Python'scur_sample - temp_endevaluated before the frame is consumed. Same correction applies to themin_silence_at_max_speech_samplescomparator on the same line block, which used the same off-by-one counter.Audit existing tests:
middle_band_frames_do_not_reset_tentative_endandmin_speech_duration_is_checked_before_paddingextended their trailing silence runs by one frame so the close still fires viapush_probabilityunder the corrected counter.force_split_during_silence_closes_without_restartingraised itsmax_speech_durationceiling by one frame so the max-speech split still fires aftermax_split_endhas been recorded by the silence-counter logic.tests/parity/as motivation.Add two new regression tests:
four_frame_silence_dip_does_not_close_segment_at_default_min_silencepins that a 4-frame (128 ms) silence dip is now tolerated.five_frame_silence_dip_closes_segment_at_default_min_silencepins that 5 consecutive low-prob frames still close, matching upstream.Discovered by the parity harness in commits dd64c35 / 003e8b6 / da8c0de.
test(parity): drop now-unneeded min-silence override
The
--min-silence-ms 132workaround inrun.shwas compensating for an off-by-one inSpeechSegmenter::push_probability's silence counter (silero v0.2.x evaluatedsilence_samples = current_sample - silence_startafter the current frame's increment, while Python evaluates the equivalentcur_sample - temp_endbefore the current frame is consumed). The crate fix in the previous commit aligns the two semantics, so the runner now uses upstream silero-vad defaults verbatim and parity numbers are unchanged.Verified on all 5 short fixtures (01_dialogue, 02_pyannote_sample, 03_dual_speaker, 04_three_speaker, 05_four_speaker): median IoU 1.0000 and segment counts match exactly (51/51, 4/4, 14/14, 6/6, 14/14) without the override — same numbers the override produced pre-fix.
The
--min-silence-msflag remains on the runner CLI for advanced users who want to override per-run; onlyrun.shno longer applies it. README updated to mark the off-by-one finding as fixed in v0.3.0 and preserve the previous analysis as historical context.chore: bump to 0.3.0 with CHANGELOG entry for the silence threshold fix
The silence-counter fix in
SpeechSegmenter::push_probabilityis a behaviour change for any caller that hand-tunedmin_silence_duration_msagainst v0.2.x's response curve. Bumping the minor version (0.2.x → 0.3.0) signals that even though it's strictly a bug fix, the new response curve may require re-tuning at the call site. Default callers do not need to change anything.CHANGELOG entry covers what changed (silence-counter semantics now match upstream Python silero-vad), why (parity harness uncovered an off-by-one), and the migration note (subtract ~32 ms from hand-tuned
min_silence_duration_msoverrides if you want to keep the v0.2.x effective behaviour).