Skip to content

norwytch/ASVspoof

Repository files navigation

ASVspoof 2021 Stress-Testing with Degraded Channels and Unseen Generators

tests

Two questions about a pretrained audio deepfake detector — SSL_Anti-spoofing (wav2vec2 XLS-R 300M + AASIST) — on ASVspoof 2021 LA:

  • Part 1: where does it break? EER under real-world audio degradation, with a per-attack breakdown.
  • Part 2: why does it fail to generalize? A leave-one-attack-out study on the frozen XLS-R embedding, asking what makes a detector miss an unseen generator.

The detector is loaded fairseq-free via an exact weight remap of the published checkpoint (src/ssl_aasist.py).

Part 1 — Robustness under degradation

Re-scores the eval set under MP3 compression (8–128 kbps), telephony (300–3400 Hz bandpass

  • G.711 mu-law), additive noise (0–30 dB SNR), and chunked/streaming inference, plus a per-attack (A07–A19) and native-codec breakdown. Code in src/degradations.py and src/evaluate.py.

Part 2 — Why it fails to generalize

A leave-one-attack-out study on the frozen embedding. The pre-registered hypothesis (H1) was that how strongly the embedding encodes a generator's identity predicts non-transfer to that held-out generator. H1 was falsified: identity is linearly decodable at every layer, so it can't explain why only some attacks fail. The replacement (H2) — non-transfer tracks how close a generator sits to the bona-fide manifold — held up. Full design, controls, and references in research-design.md.

Retrieval — search over the embeddings

A nearest-neighbour layer on the same frozen embeddings. src/retrieval.py has a from-scratch random-hyperplane LSH index (with a brute-force reference and an optional FAISS backend) and three heads: a non-parametric k-NN detector, generator attribution by neighbour vote, and an open-set novelty score (distance to everything known). The novelty score is the retrieval view of Part 2 — a generator near the bona-fide manifold (A19) sits close to indexed bona fide, gets a low novelty score, and evades, which is its non-transfer made geometric. scripts/retrieval_eval.py runs the recall@k / latency benchmark (the hand-rolled LSH against FAISS), the k-NN detector EER, attribution accuracy, and per-attack open-set novelty. Audio has no lexical channel, so this is dense-only — there's no sparse/BM25 side to add.

Key findings

  • Clean EER 0.82%, AUC 0.998 on the 148,176-trial eval set, matching the published SSL_Anti-spoofing baseline.
  • Noise, not compression, is the failure mode. MP3 is roughly free (~0.7% at 32–128 kbps); additive noise pushes EER to 9.8% at 0 dB. Streaming needs about 4 s of context (2.7% at 2 s, 13.8% at 0.5 s). Native codecs are all under 1%.
  • No seen-attack blind spot: every eval attack is at or below 2.6% EER.
  • H1 falsified, H2 supported. Bona-fide proximity predicts the leave-one-attack-out gap (distance-to-bona vs. gap ρ=−0.67, p=0.013). A19, the generator closest to real speech, has the largest gap (+14.8 pp); fine-tuning the encoder moves it off the bona manifold and the gap drops to +1.7 pp.

Full write-up and figures in report.md.

Caveats

  • An earlier version reported a 9.73% clean EER. That was two bugs: zero-padding instead of the recipe's repeat-padding, and a protocol-parser leak that scored 16,926 out-of-spec trials. Both are fixed; clean EER is now 0.82%. The earlier "A10 blind spot" was an artifact of those bugs (A10 is 0.55% once corrected).
  • Part 2 is correlational, n=13 attacks, one corpus. On the detector's own AASIST representation the gap nearly vanishes (A19 +0.13 pp), so the effect is a property of the frozen-SSL probe rather than the deployed model, and task-tuning removes it. Cross-dataset validation is the main next step.
  • The four detection extensions (NLP, attack profiling, reconstruction, prosody) are implemented and unit-tested but not yet run at scale.

Setup

python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
brew install ffmpeg

Use Python ≤ 3.12 (the G.711 path uses stdlib audioop, removed in 3.13). Download data per data/README.md. The proposal's lab260/AASIST3 checkpoint is degenerate across every public mirror (~63% EER), which is why this uses SSL_Anti-spoofing.

Data and artifacts

The corpus is gitignored. Cached embeddings and the model weights are on Hugging Face:

hf download sempertemper/asvspoof-xlsr-embeddings asvspoof_xlsr_embeddings.tar --repo-type dataset --local-dir results/
tar -xf results/asvspoof_xlsr_embeddings.tar -C results/
hf download sempertemper/ssl-antispoofing-weights ssl_antispoofing_weights.tar --local-dir third_party/weights/
tar -xf third_party/weights/ssl_antispoofing_weights.tar -C third_party/weights/

Both also regenerate from scratch — weights from the original SSL_Anti-spoofing repo, embeddings from scripts/cache_embeddings.py.

Reproduce

# Part 1 — degradation sweep + figures
python -m src.evaluate --protocol data/asvspoof2021_LA/keys/CM/trial_metadata.txt \
                       --flac-dir data/asvspoof2021_LA/flac --full
python scripts/make_figures.py results/per_attack_eer_full.csv

# Part 2 — cache embeddings, then LOAO / H1 / H2 / Regime B
python -m scripts.cache_embeddings --subset 8000
python -m scripts.loao_per_attack --emb-dir results/embeddings --out results/loao_per_attack.csv
python -m scripts.layer_sweep_selectivity                       # H1
python -m scripts.geometry_h2                                   # H2
python -m scripts.cache_embeddings_ft --subset 8000 && python -m scripts.compare_regimes
python -m scripts.make_part2_figures

# Retrieval — ANN benchmark (LSH vs FAISS) + k-NN detector / attribution / open-set novelty
python -m scripts.retrieval_eval --emb-dir results/embeddings --layer 9

Roadmap

  1. Confound controls and layer-robustness. scripts/confound_controls.py checks whether H1 and H2 survive the codec and speaker confounds; geometry_h2.py --layer checks H2 isn't specific to layer 9. Both run on the cached embeddings.
  2. Cross-dataset generalization — the biggest upgrade. The current unseen-generator test is leave-one-out within one 2019-era corpus; re-running it training on 2019/2021 and testing on In-the-Wild and ASVspoof 5 would cover genuinely novel generators and more attacks.
  3. Realistic degradation: MUSAN babble and reverb instead of white noise, and a noise/codec-augmentation baseline to see whether augmenting recovers the lost robustness.
  4. Run the four detection extensions at scale.
  5. Part 2 causality: the pre-registered band-mask intervention, and whether bona-proximity can flag novel-attack risk from embedding geometry before attack samples exist.

Layout

src/          dataset, degradations, metrics, ssl_aasist loader, model, evaluate, visualize,
              + embeddings, probes (Part 2), retrieval, + the four extensions
experiments/  loao.py — leave-one-attack-out runner
scripts/      cache_embeddings[_ft], loao_per_attack, layer_sweep_selectivity, geometry_h2,
              compare_regimes, confound_controls, retrieval_eval, make_figures
tests/        pytest suite for the dep-free logic (run in CI)
data/         download instructions + attack_taxonomy.json (corpora gitignored)
results/      figures + summary CSVs (embeddings/scores gitignored; on Hugging Face)
report.md            full write-up of both parts, with figures
research-design.md   Part 2 design + verified references

Releases

No releases published

Packages

 
 
 

Contributors

Languages