BEATS-SED

BEATS-SED is a semi-supervised Sound Event Detection (SED) framework built on a modified ATST-SED (ICASSP 2024) codebase. The ATST encoder is replaced with BEATs and paired with three sequence modeling backends:

BEATs + BiGRU
BEATs + Conformer
BEATs + Zipformer

Training is fully config-driven. In most cases, you only need to edit a YAML file and launch training.

Changes from ATST-SED

The following modifications were made to the original ATST-SED implementation:

Replaced the ATST encoder with BEATs
Added BiGRU, Conformer, and Zipformer sequence modeling backends
Extended the YAML configuration system to cover all model variants
Added assertion-based validation throughout the training pipeline
Improved robustness to configuration and dataset mismatches
Simplified experiment setup through config-driven training

The overall structure follows ATST-SED, but most components have been updated or replaced.

Repository Structure

BEATS-SED/
├── 1_setup_env/    # Environment setup scripts and Dockerfiles
├── desed_task/     # Dataset utilities and training pipeline
├── train/          # Training scripts and YAML configs
├── work/           # Checkpoints and logs
├── setup.py
└── README.md

Environment Setup

Two options are available:

Use one of the Dockerfiles in 1_setup_env/
Manually recreate the required environment

Then install the package in editable mode:

pip install -e .

Blackwell GPU Support

Separate setup scripts and Dockerfiles are provided for NVIDIA Blackwell GPUs:

1_setup_env/
├── beats_sed.sh
├── beats_sed_blackwell.sh
├── Dockerfile
└── Dockerfile.blackwell

Use beats_sed_blackwell.sh or Dockerfile.blackwell when setting up on Blackwell hardware. No code changes are required — the training pipeline is hardware-agnostic once the correct environment is active.

Configuration

Training is controlled through YAML config files located at:

train/config/beats/bigru/stage1_bigru.yaml
train/config/beats/conformer/stage1_conformer.yaml
train/config/beats/zipformer/stage1_zipformer.yaml

Before training, update the following fields in the appropriate config:

Dataset paths
Model depth (number of layers, hidden sizes, etc.)
Batch size and number of training epochs

The pipeline includes validation checks and will raise descriptive errors for misconfigured or missing fields.

Training

BEATs + BiGRU

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python3 -u train/train_stage1.py \
  --conf_file train/config/beats/bigru/stage1_bigru.yaml \
  --gpus 0

BEATs + Conformer

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python3 -u train/train_stage1.py \
  --conf_file train/config/beats/conformer/stage1_conformer.yaml \
  --gpus 0

BEATs + Zipformer

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python3 -u train/train_stage1.py \
  --conf_file train/config/beats/zipformer/stage1_zipformer.yaml \
  --gpus 0

SED Training Data Setup

This framework follows the semi-supervised SED paradigm from DCASE Task 4. Training combines four data sources.

1. Strongly Labeled Data

Clips with precise event-level annotations (onset and offset timestamps).

Requirements:

Accurate temporal boundaries
Clean, consistent label taxonomy

Recommended size: 3–10% of total training audio. As little as 3–5 hours can be sufficient with high-quality annotations. Annotation precision matters more than dataset size — this source anchors the model's temporal localization.

2. Synthetic Data

Artificial mixtures created by overlaying isolated sound events onto background audio.

Requirements:

Generate mixtures using only strong-set events; never use validation or test events
Use background audio that is acoustically diverse and unrelated to target events
Maintain realistic signal-to-noise ratio (SNR) ranges

Recommended size: 3–10× the duration of strongly labeled data; typically 20–40% of total training audio. Synthetic data should complement real strong data, not dominate it.

3. Weakly Labeled Data

Clips with clip-level presence labels only (no timestamps).

Requirements:

Reliable presence labels
Balanced positive and negative samples
Negative samples should be acoustically plausible (not silent or off-domain)

Recommended size: Comparable to or slightly larger than the synthetic set; typically 20–40% of total training audio. Weak labels improve representation learning and event presence detection but do not directly supervise temporal localization.

4. Unlabeled Data

Unannotated in-domain audio used for teacher–student or consistency-based learning.

Requirements:

Should ideally match the target deployment domain
No annotation required

Recommended size: The largest component; typically 30–50%+ of total training audio or 5–10× the strong dataset duration. Unlabeled data improves generalization and reduces domain mismatch.

Recommended Data Ratios

A stable configuration across many SED scenarios:

Source	Proportion
Strong	~5%
Synthetic	~25%
Weak	~30%
Unlabeled	~40%

Example: Given 5 hours of strongly labeled audio:

20–30 hours synthetic
20–30 hours weak
30–50+ hours unlabeled

Exact ratios depend on annotation quality, domain mismatch, event frequency, and model capacity. Any data source can be disabled; best performance typically requires all four.

Important Guidelines

Never include validation or test events in synthetic mixtures.
Negative weak-labeled samples must be acoustically plausible — random silence or out-of-domain noise degrades performance.
Maintain controlled class balance across all data sources.
Strong annotations must be temporally precise — noisy timestamps are more harmful than a small dataset.
Synthetic mixtures should simulate realistic overlap conditions (multiple simultaneous events at varying SNRs).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BEATS-SED

Changes from ATST-SED

Repository Structure

Environment Setup

Blackwell GPU Support

Configuration

Training

BEATs + BiGRU

BEATs + Conformer

BEATs + Zipformer

SED Training Data Setup

1. Strongly Labeled Data

2. Synthetic Data

3. Weakly Labeled Data

4. Unlabeled Data

Recommended Data Ratios

Important Guidelines

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
1_setup_env		1_setup_env
desed_task		desed_task
train		train
work		work
README.md		README.md
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

BEATS-SED

Changes from ATST-SED

Repository Structure

Environment Setup

Blackwell GPU Support

Configuration

Training

BEATs + BiGRU

BEATs + Conformer

BEATs + Zipformer

SED Training Data Setup

1. Strongly Labeled Data

2. Synthetic Data

3. Weakly Labeled Data

4. Unlabeled Data

Recommended Data Ratios

Important Guidelines

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages