Skip to content

UTA-ACL2/BEATS-SED

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BEATS-SED

BEATS-SED is a semi-supervised Sound Event Detection (SED) framework built on a modified ATST-SED (ICASSP 2024) codebase. The ATST encoder is replaced with BEATs and paired with three sequence modeling backends:

  • BEATs + BiGRU
  • BEATs + Conformer
  • BEATs + Zipformer

Training is fully config-driven. In most cases, you only need to edit a YAML file and launch training.


Changes from ATST-SED

The following modifications were made to the original ATST-SED implementation:

  • Replaced the ATST encoder with BEATs
  • Added BiGRU, Conformer, and Zipformer sequence modeling backends
  • Extended the YAML configuration system to cover all model variants
  • Added assertion-based validation throughout the training pipeline
  • Improved robustness to configuration and dataset mismatches
  • Simplified experiment setup through config-driven training

The overall structure follows ATST-SED, but most components have been updated or replaced.


Repository Structure

BEATS-SED/
├── 1_setup_env/    # Environment setup scripts and Dockerfiles
├── desed_task/     # Dataset utilities and training pipeline
├── train/          # Training scripts and YAML configs
├── work/           # Checkpoints and logs
├── setup.py
└── README.md

Environment Setup

Two options are available:

  1. Use one of the Dockerfiles in 1_setup_env/
  2. Manually recreate the required environment

Then install the package in editable mode:

pip install -e .

Blackwell GPU Support

Separate setup scripts and Dockerfiles are provided for NVIDIA Blackwell GPUs:

1_setup_env/
├── beats_sed.sh
├── beats_sed_blackwell.sh
├── Dockerfile
└── Dockerfile.blackwell

Use beats_sed_blackwell.sh or Dockerfile.blackwell when setting up on Blackwell hardware. No code changes are required — the training pipeline is hardware-agnostic once the correct environment is active.


Configuration

Training is controlled through YAML config files located at:

train/config/beats/bigru/stage1_bigru.yaml
train/config/beats/conformer/stage1_conformer.yaml
train/config/beats/zipformer/stage1_zipformer.yaml

Before training, update the following fields in the appropriate config:

  • Dataset paths
  • Model depth (number of layers, hidden sizes, etc.)
  • Batch size and number of training epochs

The pipeline includes validation checks and will raise descriptive errors for misconfigured or missing fields.


Training

BEATs + BiGRU

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python3 -u train/train_stage1.py \
  --conf_file train/config/beats/bigru/stage1_bigru.yaml \
  --gpus 0

BEATs + Conformer

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python3 -u train/train_stage1.py \
  --conf_file train/config/beats/conformer/stage1_conformer.yaml \
  --gpus 0

BEATs + Zipformer

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python3 -u train/train_stage1.py \
  --conf_file train/config/beats/zipformer/stage1_zipformer.yaml \
  --gpus 0

SED Training Data Setup

This framework follows the semi-supervised SED paradigm from DCASE Task 4. Training combines four data sources.

1. Strongly Labeled Data

Clips with precise event-level annotations (onset and offset timestamps).

Requirements:

  • Accurate temporal boundaries
  • Clean, consistent label taxonomy

Recommended size: 3–10% of total training audio. As little as 3–5 hours can be sufficient with high-quality annotations. Annotation precision matters more than dataset size — this source anchors the model's temporal localization.


2. Synthetic Data

Artificial mixtures created by overlaying isolated sound events onto background audio.

Requirements:

  • Generate mixtures using only strong-set events; never use validation or test events
  • Use background audio that is acoustically diverse and unrelated to target events
  • Maintain realistic signal-to-noise ratio (SNR) ranges

Recommended size: 3–10× the duration of strongly labeled data; typically 20–40% of total training audio. Synthetic data should complement real strong data, not dominate it.


3. Weakly Labeled Data

Clips with clip-level presence labels only (no timestamps).

Requirements:

  • Reliable presence labels
  • Balanced positive and negative samples
  • Negative samples should be acoustically plausible (not silent or off-domain)

Recommended size: Comparable to or slightly larger than the synthetic set; typically 20–40% of total training audio. Weak labels improve representation learning and event presence detection but do not directly supervise temporal localization.


4. Unlabeled Data

Unannotated in-domain audio used for teacher–student or consistency-based learning.

Requirements:

  • Should ideally match the target deployment domain
  • No annotation required

Recommended size: The largest component; typically 30–50%+ of total training audio or 5–10× the strong dataset duration. Unlabeled data improves generalization and reduces domain mismatch.


Recommended Data Ratios

A stable configuration across many SED scenarios:

Source Proportion
Strong ~5%
Synthetic ~25%
Weak ~30%
Unlabeled ~40%

Example: Given 5 hours of strongly labeled audio:

  • 20–30 hours synthetic
  • 20–30 hours weak
  • 30–50+ hours unlabeled

Exact ratios depend on annotation quality, domain mismatch, event frequency, and model capacity. Any data source can be disabled; best performance typically requires all four.


Important Guidelines

  • Never include validation or test events in synthetic mixtures.
  • Negative weak-labeled samples must be acoustically plausible — random silence or out-of-domain noise degrades performance.
  • Maintain controlled class balance across all data sources.
  • Strong annotations must be temporally precise — noisy timestamps are more harmful than a small dataset.
  • Synthetic mixtures should simulate realistic overlap conditions (multiple simultaneous events at varying SNRs).

About

BEATs-based Sound Event Detection framework (BiGRU, Conformer, Zipformer) built on a modified ATST-SED pipeline.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors