Skip to content

test(parity): rust runner for silero VAD parity harness#5

Merged
uqio merged 4 commits intomainfrom
test/parity
May 3, 2026
Merged

test(parity): rust runner for silero VAD parity harness#5
uqio merged 4 commits intomainfrom
test/parity

Conversation

@al8n
Copy link
Copy Markdown
Contributor

@al8n al8n commented May 3, 2026

Adds tests/parity/Cargo.toml and src/main.rs for the silero-parity-runner binary that loads a 16 kHz mono WAV via ffmpeg-next, runs silero::detect_speech with the bundled ONNX model, and emits a JSON segment list. Pairs with the Python runner (next commit) for side-by-side comparison against upstream silero-vad.

The runner uses ffmpeg-next (not hound) for audio loading so the f32 buffer the model sees is byte-identical to what upstream Python silero-vad consumes via torchaudio/ffmpeg, letting the parity score verify both runners decoded the audio the same way before flagging any output divergence as a model issue. Same pattern whispery's parity harness uses.

Cargo.lock for the parity binary lives at tests/parity/Cargo.lock and is gitignored via the existing Cargo.lock rule. The parity crate is excluded from cargo package because nested Cargo.toml workspaces are skipped automatically.

test(parity): python runner + scorer for upstream silero-vad

Adds the upstream Python silero-vad reference side of the parity harness:

  • python/pyproject.toml: pins silero-vad >= 5.1 and `onnxruntime

    = 1.18(theload_silero_vad(onnx=True)` path needs onnxruntime).

  • python/silero_vad_runner.py: same CLI / same JSON schema as the Rust runner. Defaults to --backend onnx so both runners feed byte-identical bytes to ORT — same model file, same backend — and any IoU disagreement is segmenter logic, not the inference runtime. --backend jit available for measuring runtime drift separately. Audio loading is an ffmpeg shell-out matching WhisperX's load_audio (pcm_s16le -ac 1 -ar 16000np.float32 / 32768.0), same byte path as the Rust loader.
  • python/score.py: sequence-position pairing, per-segment IoU, median + p10/p90 + worst-N report, JSON summary on stdout / --out. Pass/fail: median IoU >= 0.95 AND segment counts match.

test(parity): driver script and README

run.sh brings up the uv venv, runs both runners, and pipes the JSONs through score.py. Accepts a fixture directory (uses clip_16k.wav inside) or a direct WAV path.

run.sh passes --min-silence-ms 132 to the Rust side as a parity override (NOT a crate-default change). The silero crate's SpeechSegmenter::push_probability computes
silence_samples = current_sample - silence_start AFTER the current frame's increment, while upstream Python silero-vad computes it BEFORE — a one-frame (32 ms at 16 kHz / 512-sample windows) off-by-one that causes the crate to close segments after 4 low-prob frames vs Python's 5. Bumping the override by exactly one frame on the crate side restores byte-identical segment counts (verified on all 5 dia parity fixtures: median IoU 1.0000 across the board).

README documents the layout, prerequisites (cargo + uv + ffmpeg, no ORT_DYLIB_PATH needed because ort 2.0.0-rc.12's default features include download-binaries + copy-dylibs), the canonical fixture set (dia's parity fixtures, intentionally not copied), the parameter alignment table (named defaults match upstream silero-vad 6.2.1 exactly), and the off-by-one silence threshold finding in detail.

fix(detector): silence threshold counter matches upstream silero-vad

The silence-counter in SpeechSegmenter::push_probability evaluated silence_samples = current_sample - silence_start AFTER the current frame's contribution had been added to current_sample, while upstream Python silero-vad evaluates the equivalent sil_dur_now = cur_sample - temp_end BEFORE the current frame is consumed. The crate's counter therefore fired one model frame (32 ms at 16 kHz / 512-sample windows) early — at the default
min_silence_duration_ms = 100, the crate closed a segment after 4 consecutive low-probability frames where Python tolerates the dip and closes after 5.

Switch the comparator to frame_start.saturating_sub(silence_start), mirroring Python's cur_sample - temp_end evaluated before the frame is consumed. Same correction applies to the
min_silence_at_max_speech_samples comparator on the same line block, which used the same off-by-one counter.

Audit existing tests:

  • middle_band_frames_do_not_reset_tentative_end and min_speech_duration_is_checked_before_padding extended their trailing silence runs by one frame so the close still fires via push_probability under the corrected counter.
  • force_split_during_silence_closes_without_restarting raised its max_speech_duration ceiling by one frame so the max-speech split still fires after max_split_end has been recorded by the silence-counter logic.
  • All updated tests document the change in their docstrings, citing the parity harness in tests/parity/ as motivation.

Add two new regression tests:

  • four_frame_silence_dip_does_not_close_segment_at_default_min_silence pins that a 4-frame (128 ms) silence dip is now tolerated.
  • five_frame_silence_dip_closes_segment_at_default_min_silence pins that 5 consecutive low-prob frames still close, matching upstream.

Discovered by the parity harness in commits dd64c35 / 003e8b6 / da8c0de.

test(parity): drop now-unneeded min-silence override

The --min-silence-ms 132 workaround in run.sh was compensating for an off-by-one in SpeechSegmenter::push_probability's silence counter (silero v0.2.x evaluated silence_samples = current_sample - silence_start after the current frame's increment, while Python evaluates the equivalent cur_sample - temp_end before the current frame is consumed). The crate fix in the previous commit aligns the two semantics, so the runner now uses upstream silero-vad defaults verbatim and parity numbers are unchanged.

Verified on all 5 short fixtures (01_dialogue, 02_pyannote_sample, 03_dual_speaker, 04_three_speaker, 05_four_speaker): median IoU 1.0000 and segment counts match exactly (51/51, 4/4, 14/14, 6/6, 14/14) without the override — same numbers the override produced pre-fix.

The --min-silence-ms flag remains on the runner CLI for advanced users who want to override per-run; only run.sh no longer applies it. README updated to mark the off-by-one finding as fixed in v0.3.0 and preserve the previous analysis as historical context.

chore: bump to 0.3.0 with CHANGELOG entry for the silence threshold fix

The silence-counter fix in SpeechSegmenter::push_probability is a behaviour change for any caller that hand-tuned
min_silence_duration_ms against v0.2.x's response curve. Bumping the minor version (0.2.x → 0.3.0) signals that even though it's strictly a bug fix, the new response curve may require re-tuning at the call site. Default callers do not need to change anything.

CHANGELOG entry covers what changed (silence-counter semantics now match upstream Python silero-vad), why (parity harness uncovered an off-by-one), and the migration note (subtract ~32 ms from hand-tuned min_silence_duration_ms overrides if you want to keep the v0.2.x effective behaviour).

Adds `tests/parity/Cargo.toml` and `src/main.rs` for the
`silero-parity-runner` binary that loads a 16 kHz mono WAV via
`ffmpeg-next`, runs `silero::detect_speech` with the bundled ONNX
model, and emits a JSON segment list. Pairs with the Python runner
(next commit) for side-by-side comparison against upstream
`silero-vad`.

The runner uses `ffmpeg-next` (not `hound`) for audio loading so the
f32 buffer the model sees is byte-identical to what upstream Python
silero-vad consumes via torchaudio/ffmpeg, letting the parity score
verify both runners decoded the audio the same way before flagging
any output divergence as a model issue. Same pattern whispery's
parity harness uses.

Cargo.lock for the parity binary lives at `tests/parity/Cargo.lock`
and is gitignored via the existing `Cargo.lock` rule. The parity
crate is excluded from `cargo package` because nested `Cargo.toml`
workspaces are skipped automatically.

test(parity): python runner + scorer for upstream silero-vad

Adds the upstream Python `silero-vad` reference side of the parity
harness:

- `python/pyproject.toml`: pins `silero-vad >= 5.1` and `onnxruntime
  >= 1.18` (the `load_silero_vad(onnx=True)` path needs onnxruntime).
- `python/silero_vad_runner.py`: same CLI / same JSON schema as the
  Rust runner. Defaults to `--backend onnx` so both runners feed
  byte-identical bytes to ORT — same model file, same backend — and
  any IoU disagreement is segmenter logic, not the inference runtime.
  `--backend jit` available for measuring runtime drift separately.
  Audio loading is an ffmpeg shell-out matching WhisperX's
  `load_audio` (`pcm_s16le -ac 1 -ar 16000` → `np.float32 / 32768.0`),
  same byte path as the Rust loader.
- `python/score.py`: sequence-position pairing, per-segment IoU,
  median + p10/p90 + worst-N report, JSON summary on stdout / `--out`.
  Pass/fail: median IoU >= 0.95 AND segment counts match.

test(parity): driver script and README

`run.sh` brings up the uv venv, runs both runners, and pipes the JSONs
through `score.py`. Accepts a fixture directory (uses
`clip_16k.wav` inside) or a direct WAV path.

`run.sh` passes `--min-silence-ms 132` to the Rust side as a parity
override (NOT a crate-default change). The silero crate's
`SpeechSegmenter::push_probability` computes
`silence_samples = current_sample - silence_start` AFTER the current
frame's increment, while upstream Python silero-vad computes it
BEFORE — a one-frame (32 ms at 16 kHz / 512-sample windows) off-by-one
that causes the crate to close segments after 4 low-prob frames vs
Python's 5. Bumping the override by exactly one frame on the crate
side restores byte-identical segment counts (verified on all 5 dia
parity fixtures: median IoU 1.0000 across the board).

README documents the layout, prerequisites (cargo + uv + ffmpeg, no
ORT_DYLIB_PATH needed because `ort 2.0.0-rc.12`'s default features
include `download-binaries` + `copy-dylibs`), the canonical fixture
set (dia's parity fixtures, intentionally not copied), the parameter
alignment table (named defaults match upstream silero-vad 6.2.1
exactly), and the off-by-one silence threshold finding in detail.

fix(detector): silence threshold counter matches upstream silero-vad

The silence-counter in `SpeechSegmenter::push_probability` evaluated
`silence_samples = current_sample - silence_start` AFTER the current
frame's contribution had been added to `current_sample`, while upstream
Python `silero-vad` evaluates the equivalent `sil_dur_now =
cur_sample - temp_end` BEFORE the current frame is consumed. The
crate's counter therefore fired one model frame (32 ms at 16 kHz /
512-sample windows) early — at the default
`min_silence_duration_ms = 100`, the crate closed a segment after 4
consecutive low-probability frames where Python tolerates the dip and
closes after 5.

Switch the comparator to `frame_start.saturating_sub(silence_start)`,
mirroring Python's `cur_sample - temp_end` evaluated before the frame
is consumed. Same correction applies to the
`min_silence_at_max_speech_samples` comparator on the same line block,
which used the same off-by-one counter.

Audit existing tests:
- `middle_band_frames_do_not_reset_tentative_end` and
  `min_speech_duration_is_checked_before_padding` extended their
  trailing silence runs by one frame so the close still fires via
  `push_probability` under the corrected counter.
- `force_split_during_silence_closes_without_restarting` raised its
  `max_speech_duration` ceiling by one frame so the max-speech split
  still fires after `max_split_end` has been recorded by the
  silence-counter logic.
- All updated tests document the change in their docstrings, citing
  the parity harness in `tests/parity/` as motivation.

Add two new regression tests:
- `four_frame_silence_dip_does_not_close_segment_at_default_min_silence`
  pins that a 4-frame (128 ms) silence dip is now tolerated.
- `five_frame_silence_dip_closes_segment_at_default_min_silence` pins
  that 5 consecutive low-prob frames still close, matching upstream.

Discovered by the parity harness in commits dd64c35 / 003e8b6 /
da8c0de.

test(parity): drop now-unneeded min-silence override

The `--min-silence-ms 132` workaround in `run.sh` was compensating
for an off-by-one in `SpeechSegmenter::push_probability`'s silence
counter (silero v0.2.x evaluated `silence_samples = current_sample -
silence_start` after the current frame's increment, while Python
evaluates the equivalent `cur_sample - temp_end` before the current
frame is consumed). The crate fix in the previous commit aligns the
two semantics, so the runner now uses upstream silero-vad defaults
verbatim and parity numbers are unchanged.

Verified on all 5 short fixtures (01_dialogue, 02_pyannote_sample,
03_dual_speaker, 04_three_speaker, 05_four_speaker): median IoU
1.0000 and segment counts match exactly (51/51, 4/4, 14/14, 6/6,
14/14) without the override — same numbers the override produced
pre-fix.

The `--min-silence-ms` flag remains on the runner CLI for advanced
users who want to override per-run; only `run.sh` no longer applies
it. README updated to mark the off-by-one finding as fixed in v0.3.0
and preserve the previous analysis as historical context.

chore: bump to 0.3.0 with CHANGELOG entry for the silence threshold fix

The silence-counter fix in `SpeechSegmenter::push_probability` is a
behaviour change for any caller that hand-tuned
`min_silence_duration_ms` against v0.2.x's response curve. Bumping the
minor version (0.2.x → 0.3.0) signals that even though it's strictly
a bug fix, the new response curve may require re-tuning at the call
site. Default callers do not need to change anything.

CHANGELOG entry covers what changed (silence-counter semantics now
match upstream Python silero-vad), why (parity harness uncovered an
off-by-one), and the migration note (subtract ~32 ms from
hand-tuned `min_silence_duration_ms` overrides if you want to keep
the v0.2.x effective behaviour).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a manual parity harness (Rust + Python) to compare silero’s VAD segmentation against upstream silero-vad, and aligns the Rust segmenter’s silence counter semantics with upstream Python (shipping as v0.3.0).

Changes:

  • Introduce tests/parity/ runner tooling: Rust silero-parity-runner, Python reference runner + IoU scorer, plus a driver script and README.
  • Fix an off-by-one in SpeechSegmenter::push_probability silence accounting and update/add regression tests.
  • Bump crate version to 0.3.0 and document the behavior change in CHANGELOG.md.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/parity/src/main.rs New Rust parity runner: ffmpeg decode → detect_speech → JSON output
tests/parity/Cargo.toml New standalone Cargo package for the parity runner
tests/parity/run.sh Driver script to run Rust + Python runners and score results
tests/parity/python/pyproject.toml Python env definition for upstream silero-vad reference runner
tests/parity/python/silero_vad_runner.py Python reference runner emitting the same JSON schema as Rust
tests/parity/python/score.py IoU scoring and pass/fail logic for runner outputs
tests/parity/README.md Harness documentation, parameter alignment, and historical off-by-one notes
src/detector.rs Silence-counter semantic fix + updated and new regression tests
Cargo.toml Version bump to 0.3.0
CHANGELOG.md Release notes for the silence-counter behavior change
.gitignore Ignore parity harness outputs and Python venv artifacts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/parity/src/main.rs Outdated
Comment thread tests/parity/src/main.rs
Comment thread tests/parity/python/silero_vad_runner.py Outdated
uqio added 3 commits May 3, 2026 19:07
Out-of-tree harnesses (the parity runner being the immediate caller)
need a way to record the silero crate version they're exercising. Using
`env!("CARGO_PKG_VERSION")` in those harnesses resolves to the harness
binary's own version, not silero's. Re-exporting the version string from
the library lets callers depend on `silero::VERSION` and get the value
that actually matches the running detector logic.

Surfaced by Copilot PR review of #5
(#5 (comment)).
Two issues from Copilot's PR review of #5:

1. `SILERO_CRATE_VERSION` was sourced from `env!("CARGO_PKG_VERSION")`,
   which in this binary resolves to the parity runner's own version
   (`0.0.0`) — not the silero crate version under test. The JSON output
   was misreporting `silero_crate_version`. Now uses the newly-exposed
   `silero::VERSION` constant so the JSON records the actual crate
   version being validated.
   (#5 (comment))

2. `ffmpeg_init` stored its initialisation error in a stack-local that
   was only assigned inside the `Once::call_once` closure. After the
   first failed call, subsequent invocations would silently return
   `Ok(())` because the closure no longer ran and the local was always
   `None`. Switched to a static `OnceLock<Result<(), String>>` so the
   init outcome is captured once and re-surfaced on every subsequent
   call — the function now actually behaves idempotently.
   (#5 (comment))

Verified: `cargo build --release` clean; parity smoke test on
`02_pyannote_sample` reports `silero_crate_version=0.3.0`, median IoU
1.0000.
The runner explicitly sets `max_speech_duration_s = math.inf` when
`--max-speech-s` is omitted (so the call site records the value actually
passed to `get_speech_timestamps`), but the emitted JSON was reading
`args.max_speech_s` — i.e., `null` — which contradicted the inline
comment and made the output non-self-describing.

Switched to writing `kwargs["max_speech_duration_s"]` so the JSON
records the effective value (the float when the user provided one,
otherwise `math.inf`). Python's `json.dumps` emits the latter as
`Infinity`; documented inline that downstream parsers may need
non-strict JSON support if they read this field from outside the
Python ecosystem.

Surfaced by Copilot PR review of #5
(#5 (comment)).

Verified: parity smoke test on `02_pyannote_sample` shows
`"max_speech_s": Infinity` in the Python JSON; median IoU still 1.0000.
@uqio uqio merged commit af48dc3 into main May 3, 2026
8 of 11 checks passed
@uqio uqio deleted the test/parity branch May 3, 2026 07:20
uqio pushed a commit that referenced this pull request May 3, 2026
* test(parity): rust runner for silero VAD parity harness

Adds `tests/parity/Cargo.toml` and `src/main.rs` for the
`silero-parity-runner` binary that loads a 16 kHz mono WAV via
`ffmpeg-next`, runs `silero::detect_speech` with the bundled ONNX
model, and emits a JSON segment list. Pairs with the Python runner
(next commit) for side-by-side comparison against upstream
`silero-vad`.

The runner uses `ffmpeg-next` (not `hound`) for audio loading so the
f32 buffer the model sees is byte-identical to what upstream Python
silero-vad consumes via torchaudio/ffmpeg, letting the parity score
verify both runners decoded the audio the same way before flagging
any output divergence as a model issue. Same pattern whispery's
parity harness uses.

Cargo.lock for the parity binary lives at `tests/parity/Cargo.lock`
and is gitignored via the existing `Cargo.lock` rule. The parity
crate is excluded from `cargo package` because nested `Cargo.toml`
workspaces are skipped automatically.

test(parity): python runner + scorer for upstream silero-vad

Adds the upstream Python `silero-vad` reference side of the parity
harness:

- `python/pyproject.toml`: pins `silero-vad >= 5.1` and `onnxruntime
  >= 1.18` (the `load_silero_vad(onnx=True)` path needs onnxruntime).
- `python/silero_vad_runner.py`: same CLI / same JSON schema as the
  Rust runner. Defaults to `--backend onnx` so both runners feed
  byte-identical bytes to ORT — same model file, same backend — and
  any IoU disagreement is segmenter logic, not the inference runtime.
  `--backend jit` available for measuring runtime drift separately.
  Audio loading is an ffmpeg shell-out matching WhisperX's
  `load_audio` (`pcm_s16le -ac 1 -ar 16000` → `np.float32 / 32768.0`),
  same byte path as the Rust loader.
- `python/score.py`: sequence-position pairing, per-segment IoU,
  median + p10/p90 + worst-N report, JSON summary on stdout / `--out`.
  Pass/fail: median IoU >= 0.95 AND segment counts match.

test(parity): driver script and README

`run.sh` brings up the uv venv, runs both runners, and pipes the JSONs
through `score.py`. Accepts a fixture directory (uses
`clip_16k.wav` inside) or a direct WAV path.

`run.sh` passes `--min-silence-ms 132` to the Rust side as a parity
override (NOT a crate-default change). The silero crate's
`SpeechSegmenter::push_probability` computes
`silence_samples = current_sample - silence_start` AFTER the current
frame's increment, while upstream Python silero-vad computes it
BEFORE — a one-frame (32 ms at 16 kHz / 512-sample windows) off-by-one
that causes the crate to close segments after 4 low-prob frames vs
Python's 5. Bumping the override by exactly one frame on the crate
side restores byte-identical segment counts (verified on all 5 dia
parity fixtures: median IoU 1.0000 across the board).

README documents the layout, prerequisites (cargo + uv + ffmpeg, no
ORT_DYLIB_PATH needed because `ort 2.0.0-rc.12`'s default features
include `download-binaries` + `copy-dylibs`), the canonical fixture
set (dia's parity fixtures, intentionally not copied), the parameter
alignment table (named defaults match upstream silero-vad 6.2.1
exactly), and the off-by-one silence threshold finding in detail.

fix(detector): silence threshold counter matches upstream silero-vad

The silence-counter in `SpeechSegmenter::push_probability` evaluated
`silence_samples = current_sample - silence_start` AFTER the current
frame's contribution had been added to `current_sample`, while upstream
Python `silero-vad` evaluates the equivalent `sil_dur_now =
cur_sample - temp_end` BEFORE the current frame is consumed. The
crate's counter therefore fired one model frame (32 ms at 16 kHz /
512-sample windows) early — at the default
`min_silence_duration_ms = 100`, the crate closed a segment after 4
consecutive low-probability frames where Python tolerates the dip and
closes after 5.

Switch the comparator to `frame_start.saturating_sub(silence_start)`,
mirroring Python's `cur_sample - temp_end` evaluated before the frame
is consumed. Same correction applies to the
`min_silence_at_max_speech_samples` comparator on the same line block,
which used the same off-by-one counter.

Audit existing tests:
- `middle_band_frames_do_not_reset_tentative_end` and
  `min_speech_duration_is_checked_before_padding` extended their
  trailing silence runs by one frame so the close still fires via
  `push_probability` under the corrected counter.
- `force_split_during_silence_closes_without_restarting` raised its
  `max_speech_duration` ceiling by one frame so the max-speech split
  still fires after `max_split_end` has been recorded by the
  silence-counter logic.
- All updated tests document the change in their docstrings, citing
  the parity harness in `tests/parity/` as motivation.

Add two new regression tests:
- `four_frame_silence_dip_does_not_close_segment_at_default_min_silence`
  pins that a 4-frame (128 ms) silence dip is now tolerated.
- `five_frame_silence_dip_closes_segment_at_default_min_silence` pins
  that 5 consecutive low-prob frames still close, matching upstream.

Discovered by the parity harness in commits dd64c35 / 003e8b6 /
da8c0de.

test(parity): drop now-unneeded min-silence override

The `--min-silence-ms 132` workaround in `run.sh` was compensating
for an off-by-one in `SpeechSegmenter::push_probability`'s silence
counter (silero v0.2.x evaluated `silence_samples = current_sample -
silence_start` after the current frame's increment, while Python
evaluates the equivalent `cur_sample - temp_end` before the current
frame is consumed). The crate fix in the previous commit aligns the
two semantics, so the runner now uses upstream silero-vad defaults
verbatim and parity numbers are unchanged.

Verified on all 5 short fixtures (01_dialogue, 02_pyannote_sample,
03_dual_speaker, 04_three_speaker, 05_four_speaker): median IoU
1.0000 and segment counts match exactly (51/51, 4/4, 14/14, 6/6,
14/14) without the override — same numbers the override produced
pre-fix.

The `--min-silence-ms` flag remains on the runner CLI for advanced
users who want to override per-run; only `run.sh` no longer applies
it. README updated to mark the off-by-one finding as fixed in v0.3.0
and preserve the previous analysis as historical context.

chore: bump to 0.3.0 with CHANGELOG entry for the silence threshold fix

The silence-counter fix in `SpeechSegmenter::push_probability` is a
behaviour change for any caller that hand-tuned
`min_silence_duration_ms` against v0.2.x's response curve. Bumping the
minor version (0.2.x → 0.3.0) signals that even though it's strictly
a bug fix, the new response curve may require re-tuning at the call
site. Default callers do not need to change anything.

CHANGELOG entry covers what changed (silence-counter semantics now
match upstream Python silero-vad), why (parity harness uncovered an
off-by-one), and the migration note (subtract ~32 ms from
hand-tuned `min_silence_duration_ms` overrides if you want to keep
the v0.2.x effective behaviour).

* feat(lib): expose `silero::VERSION` as a public constant

Out-of-tree harnesses (the parity runner being the immediate caller)
need a way to record the silero crate version they're exercising. Using
`env!("CARGO_PKG_VERSION")` in those harnesses resolves to the harness
binary's own version, not silero's. Re-exporting the version string from
the library lets callers depend on `silero::VERSION` and get the value
that actually matches the running detector logic.

Surfaced by Copilot PR review of #5
(#5 (comment)).

* fix(parity): record silero crate version + persist ffmpeg init outcome

Two issues from Copilot's PR review of #5:

1. `SILERO_CRATE_VERSION` was sourced from `env!("CARGO_PKG_VERSION")`,
   which in this binary resolves to the parity runner's own version
   (`0.0.0`) — not the silero crate version under test. The JSON output
   was misreporting `silero_crate_version`. Now uses the newly-exposed
   `silero::VERSION` constant so the JSON records the actual crate
   version being validated.
   (#5 (comment))

2. `ffmpeg_init` stored its initialisation error in a stack-local that
   was only assigned inside the `Once::call_once` closure. After the
   first failed call, subsequent invocations would silently return
   `Ok(())` because the closure no longer ran and the local was always
   `None`. Switched to a static `OnceLock<Result<(), String>>` so the
   init outcome is captured once and re-surfaced on every subsequent
   call — the function now actually behaves idempotently.
   (#5 (comment))

Verified: `cargo build --release` clean; parity smoke test on
`02_pyannote_sample` reports `silero_crate_version=0.3.0`, median IoU
1.0000.

* fix(parity): record effective max_speech_duration_s in Python output

The runner explicitly sets `max_speech_duration_s = math.inf` when
`--max-speech-s` is omitted (so the call site records the value actually
passed to `get_speech_timestamps`), but the emitted JSON was reading
`args.max_speech_s` — i.e., `null` — which contradicted the inline
comment and made the output non-self-describing.

Switched to writing `kwargs["max_speech_duration_s"]` so the JSON
records the effective value (the float when the user provided one,
otherwise `math.inf`). Python's `json.dumps` emits the latter as
`Infinity`; documented inline that downstream parsers may need
non-strict JSON support if they read this field from outside the
Python ecosystem.

Surfaced by Copilot PR review of #5
(#5 (comment)).

Verified: parity smoke test on `02_pyannote_sample` shows
`"max_speech_s": Infinity` in the Python JSON; median IoU still 1.0000.

---------
uqio added a commit that referenced this pull request May 3, 2026
* test(parity): rust runner for silero VAD parity harness

Adds `tests/parity/Cargo.toml` and `src/main.rs` for the
`silero-parity-runner` binary that loads a 16 kHz mono WAV via
`ffmpeg-next`, runs `silero::detect_speech` with the bundled ONNX
model, and emits a JSON segment list. Pairs with the Python runner
(next commit) for side-by-side comparison against upstream
`silero-vad`.

The runner uses `ffmpeg-next` (not `hound`) for audio loading so the
f32 buffer the model sees is byte-identical to what upstream Python
silero-vad consumes via torchaudio/ffmpeg, letting the parity score
verify both runners decoded the audio the same way before flagging
any output divergence as a model issue. Same pattern whispery's
parity harness uses.

Cargo.lock for the parity binary lives at `tests/parity/Cargo.lock`
and is gitignored via the existing `Cargo.lock` rule. The parity
crate is excluded from `cargo package` because nested `Cargo.toml`
workspaces are skipped automatically.

test(parity): python runner + scorer for upstream silero-vad

Adds the upstream Python `silero-vad` reference side of the parity
harness:

- `python/pyproject.toml`: pins `silero-vad >= 5.1` and `onnxruntime
  >= 1.18` (the `load_silero_vad(onnx=True)` path needs onnxruntime).
- `python/silero_vad_runner.py`: same CLI / same JSON schema as the
  Rust runner. Defaults to `--backend onnx` so both runners feed
  byte-identical bytes to ORT — same model file, same backend — and
  any IoU disagreement is segmenter logic, not the inference runtime.
  `--backend jit` available for measuring runtime drift separately.
  Audio loading is an ffmpeg shell-out matching WhisperX's
  `load_audio` (`pcm_s16le -ac 1 -ar 16000` → `np.float32 / 32768.0`),
  same byte path as the Rust loader.
- `python/score.py`: sequence-position pairing, per-segment IoU,
  median + p10/p90 + worst-N report, JSON summary on stdout / `--out`.
  Pass/fail: median IoU >= 0.95 AND segment counts match.

test(parity): driver script and README

`run.sh` brings up the uv venv, runs both runners, and pipes the JSONs
through `score.py`. Accepts a fixture directory (uses
`clip_16k.wav` inside) or a direct WAV path.

`run.sh` passes `--min-silence-ms 132` to the Rust side as a parity
override (NOT a crate-default change). The silero crate's
`SpeechSegmenter::push_probability` computes
`silence_samples = current_sample - silence_start` AFTER the current
frame's increment, while upstream Python silero-vad computes it
BEFORE — a one-frame (32 ms at 16 kHz / 512-sample windows) off-by-one
that causes the crate to close segments after 4 low-prob frames vs
Python's 5. Bumping the override by exactly one frame on the crate
side restores byte-identical segment counts (verified on all 5 dia
parity fixtures: median IoU 1.0000 across the board).

README documents the layout, prerequisites (cargo + uv + ffmpeg, no
ORT_DYLIB_PATH needed because `ort 2.0.0-rc.12`'s default features
include `download-binaries` + `copy-dylibs`), the canonical fixture
set (dia's parity fixtures, intentionally not copied), the parameter
alignment table (named defaults match upstream silero-vad 6.2.1
exactly), and the off-by-one silence threshold finding in detail.

fix(detector): silence threshold counter matches upstream silero-vad

The silence-counter in `SpeechSegmenter::push_probability` evaluated
`silence_samples = current_sample - silence_start` AFTER the current
frame's contribution had been added to `current_sample`, while upstream
Python `silero-vad` evaluates the equivalent `sil_dur_now =
cur_sample - temp_end` BEFORE the current frame is consumed. The
crate's counter therefore fired one model frame (32 ms at 16 kHz /
512-sample windows) early — at the default
`min_silence_duration_ms = 100`, the crate closed a segment after 4
consecutive low-probability frames where Python tolerates the dip and
closes after 5.

Switch the comparator to `frame_start.saturating_sub(silence_start)`,
mirroring Python's `cur_sample - temp_end` evaluated before the frame
is consumed. Same correction applies to the
`min_silence_at_max_speech_samples` comparator on the same line block,
which used the same off-by-one counter.

Audit existing tests:
- `middle_band_frames_do_not_reset_tentative_end` and
  `min_speech_duration_is_checked_before_padding` extended their
  trailing silence runs by one frame so the close still fires via
  `push_probability` under the corrected counter.
- `force_split_during_silence_closes_without_restarting` raised its
  `max_speech_duration` ceiling by one frame so the max-speech split
  still fires after `max_split_end` has been recorded by the
  silence-counter logic.
- All updated tests document the change in their docstrings, citing
  the parity harness in `tests/parity/` as motivation.

Add two new regression tests:
- `four_frame_silence_dip_does_not_close_segment_at_default_min_silence`
  pins that a 4-frame (128 ms) silence dip is now tolerated.
- `five_frame_silence_dip_closes_segment_at_default_min_silence` pins
  that 5 consecutive low-prob frames still close, matching upstream.

Discovered by the parity harness in commits dd64c35 / 003e8b6 /
da8c0de.

test(parity): drop now-unneeded min-silence override

The `--min-silence-ms 132` workaround in `run.sh` was compensating
for an off-by-one in `SpeechSegmenter::push_probability`'s silence
counter (silero v0.2.x evaluated `silence_samples = current_sample -
silence_start` after the current frame's increment, while Python
evaluates the equivalent `cur_sample - temp_end` before the current
frame is consumed). The crate fix in the previous commit aligns the
two semantics, so the runner now uses upstream silero-vad defaults
verbatim and parity numbers are unchanged.

Verified on all 5 short fixtures (01_dialogue, 02_pyannote_sample,
03_dual_speaker, 04_three_speaker, 05_four_speaker): median IoU
1.0000 and segment counts match exactly (51/51, 4/4, 14/14, 6/6,
14/14) without the override — same numbers the override produced
pre-fix.

The `--min-silence-ms` flag remains on the runner CLI for advanced
users who want to override per-run; only `run.sh` no longer applies
it. README updated to mark the off-by-one finding as fixed in v0.3.0
and preserve the previous analysis as historical context.

chore: bump to 0.3.0 with CHANGELOG entry for the silence threshold fix

The silence-counter fix in `SpeechSegmenter::push_probability` is a
behaviour change for any caller that hand-tuned
`min_silence_duration_ms` against v0.2.x's response curve. Bumping the
minor version (0.2.x → 0.3.0) signals that even though it's strictly
a bug fix, the new response curve may require re-tuning at the call
site. Default callers do not need to change anything.

CHANGELOG entry covers what changed (silence-counter semantics now
match upstream Python silero-vad), why (parity harness uncovered an
off-by-one), and the migration note (subtract ~32 ms from
hand-tuned `min_silence_duration_ms` overrides if you want to keep
the v0.2.x effective behaviour).

* feat(lib): expose `silero::VERSION` as a public constant

Out-of-tree harnesses (the parity runner being the immediate caller)
need a way to record the silero crate version they're exercising. Using
`env!("CARGO_PKG_VERSION")` in those harnesses resolves to the harness
binary's own version, not silero's. Re-exporting the version string from
the library lets callers depend on `silero::VERSION` and get the value
that actually matches the running detector logic.

Surfaced by Copilot PR review of #5
(#5 (comment)).

* fix(parity): record silero crate version + persist ffmpeg init outcome

Two issues from Copilot's PR review of #5:

1. `SILERO_CRATE_VERSION` was sourced from `env!("CARGO_PKG_VERSION")`,
   which in this binary resolves to the parity runner's own version
   (`0.0.0`) — not the silero crate version under test. The JSON output
   was misreporting `silero_crate_version`. Now uses the newly-exposed
   `silero::VERSION` constant so the JSON records the actual crate
   version being validated.
   (#5 (comment))

2. `ffmpeg_init` stored its initialisation error in a stack-local that
   was only assigned inside the `Once::call_once` closure. After the
   first failed call, subsequent invocations would silently return
   `Ok(())` because the closure no longer ran and the local was always
   `None`. Switched to a static `OnceLock<Result<(), String>>` so the
   init outcome is captured once and re-surfaced on every subsequent
   call — the function now actually behaves idempotently.
   (#5 (comment))

Verified: `cargo build --release` clean; parity smoke test on
`02_pyannote_sample` reports `silero_crate_version=0.3.0`, median IoU
1.0000.

* fix(parity): record effective max_speech_duration_s in Python output

The runner explicitly sets `max_speech_duration_s = math.inf` when
`--max-speech-s` is omitted (so the call site records the value actually
passed to `get_speech_timestamps`), but the emitted JSON was reading
`args.max_speech_s` — i.e., `null` — which contradicted the inline
comment and made the output non-self-describing.

Switched to writing `kwargs["max_speech_duration_s"]` so the JSON
records the effective value (the float when the user provided one,
otherwise `math.inf`). Python's `json.dumps` emits the latter as
`Infinity`; documented inline that downstream parsers may need
non-strict JSON support if they read this field from outside the
Python ecosystem.

Surfaced by Copilot PR review of #5
(#5 (comment)).

Verified: parity smoke test on `02_pyannote_sample` shows
`"max_speech_s": Infinity` in the Python JSON; median IoU still 1.0000.

---------
@uqio uqio restored the test/parity branch May 3, 2026 08:06
uqio added a commit that referenced this pull request May 3, 2026
Folds the post-review additions into the unreleased 0.3.0 entry:

- ### Added: `silero::VERSION` public constant.
- ### Added: `tests/parity/` harness (was previously only described under
  Verified, not Added).
- ### Fixed: three parity-harness bugs surfaced by the Copilot PR review
  on #5 — ffmpeg-init error swallowing, parity runner reporting its own
  version instead of silero's, Python runner emitting `null` instead of
  the effective `max_speech_duration_s`.
uqio added a commit that referenced this pull request May 3, 2026
Folds the post-review additions into the unreleased 0.3.0 entry:

- ### Added: `silero::VERSION` public constant.
- ### Added: `tests/parity/` harness (was previously only described under
  Verified, not Added).
- ### Fixed: three parity-harness bugs surfaced by the Copilot PR review
  on #5 — ffmpeg-init error swallowing, parity runner reporting its own
  version instead of silero's, Python runner emitting `null` instead of
  the effective `max_speech_duration_s`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants