BEATS-SED is a semi-supervised Sound Event Detection (SED) framework built on a modified ATST-SED (ICASSP 2024) codebase. The ATST encoder is replaced with BEATs and paired with three sequence modeling backends:
- BEATs + BiGRU
- BEATs + Conformer
- BEATs + Zipformer
Training is fully config-driven. In most cases, you only need to edit a YAML file and launch training.
The following modifications were made to the original ATST-SED implementation:
- Replaced the ATST encoder with BEATs
- Added BiGRU, Conformer, and Zipformer sequence modeling backends
- Extended the YAML configuration system to cover all model variants
- Added assertion-based validation throughout the training pipeline
- Improved robustness to configuration and dataset mismatches
- Simplified experiment setup through config-driven training
The overall structure follows ATST-SED, but most components have been updated or replaced.
BEATS-SED/
├── 1_setup_env/ # Environment setup scripts and Dockerfiles
├── desed_task/ # Dataset utilities and training pipeline
├── train/ # Training scripts and YAML configs
├── work/ # Checkpoints and logs
├── setup.py
└── README.md
Two options are available:
- Use one of the Dockerfiles in
1_setup_env/ - Manually recreate the required environment
Then install the package in editable mode:
pip install -e .Separate setup scripts and Dockerfiles are provided for NVIDIA Blackwell GPUs:
1_setup_env/
├── beats_sed.sh
├── beats_sed_blackwell.sh
├── Dockerfile
└── Dockerfile.blackwell
Use beats_sed_blackwell.sh or Dockerfile.blackwell when setting up on Blackwell hardware. No code changes are required — the training pipeline is hardware-agnostic once the correct environment is active.
Training is controlled through YAML config files located at:
train/config/beats/bigru/stage1_bigru.yaml
train/config/beats/conformer/stage1_conformer.yaml
train/config/beats/zipformer/stage1_zipformer.yaml
Before training, update the following fields in the appropriate config:
- Dataset paths
- Model depth (number of layers, hidden sizes, etc.)
- Batch size and number of training epochs
The pipeline includes validation checks and will raise descriptive errors for misconfigured or missing fields.
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python3 -u train/train_stage1.py \
--conf_file train/config/beats/bigru/stage1_bigru.yaml \
--gpus 0CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python3 -u train/train_stage1.py \
--conf_file train/config/beats/conformer/stage1_conformer.yaml \
--gpus 0CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python3 -u train/train_stage1.py \
--conf_file train/config/beats/zipformer/stage1_zipformer.yaml \
--gpus 0This framework follows the semi-supervised SED paradigm from DCASE Task 4. Training combines four data sources.
Clips with precise event-level annotations (onset and offset timestamps).
Requirements:
- Accurate temporal boundaries
- Clean, consistent label taxonomy
Recommended size: 3–10% of total training audio. As little as 3–5 hours can be sufficient with high-quality annotations. Annotation precision matters more than dataset size — this source anchors the model's temporal localization.
Artificial mixtures created by overlaying isolated sound events onto background audio.
Requirements:
- Generate mixtures using only strong-set events; never use validation or test events
- Use background audio that is acoustically diverse and unrelated to target events
- Maintain realistic signal-to-noise ratio (SNR) ranges
Recommended size: 3–10× the duration of strongly labeled data; typically 20–40% of total training audio. Synthetic data should complement real strong data, not dominate it.
Clips with clip-level presence labels only (no timestamps).
Requirements:
- Reliable presence labels
- Balanced positive and negative samples
- Negative samples should be acoustically plausible (not silent or off-domain)
Recommended size: Comparable to or slightly larger than the synthetic set; typically 20–40% of total training audio. Weak labels improve representation learning and event presence detection but do not directly supervise temporal localization.
Unannotated in-domain audio used for teacher–student or consistency-based learning.
Requirements:
- Should ideally match the target deployment domain
- No annotation required
Recommended size: The largest component; typically 30–50%+ of total training audio or 5–10× the strong dataset duration. Unlabeled data improves generalization and reduces domain mismatch.
A stable configuration across many SED scenarios:
| Source | Proportion |
|---|---|
| Strong | ~5% |
| Synthetic | ~25% |
| Weak | ~30% |
| Unlabeled | ~40% |
Example: Given 5 hours of strongly labeled audio:
- 20–30 hours synthetic
- 20–30 hours weak
- 30–50+ hours unlabeled
Exact ratios depend on annotation quality, domain mismatch, event frequency, and model capacity. Any data source can be disabled; best performance typically requires all four.
- Never include validation or test events in synthetic mixtures.
- Negative weak-labeled samples must be acoustically plausible — random silence or out-of-domain noise degrades performance.
- Maintain controlled class balance across all data sources.
- Strong annotations must be temporally precise — noisy timestamps are more harmful than a small dataset.
- Synthetic mixtures should simulate realistic overlap conditions (multiple simultaneous events at varying SNRs).