Audio-Visual Software Architecture

This repository provides the top-level orchestration and documentation of a modular Audio-Visual Sensor Fusion Software Architecture developed for reproducible research on multi-modal detection, localization, tracking, and evaluation of moving speakers in indoor environments.

The repository serves as an umbrella project that integrates all software modules of the pipeline as Git submodules, alongside experimental data and thesis documentation.

Overview

The complete processing pipeline consists of the following stages:

Visual Scene Simulation (Unity)
Room Acoustics Simulation (gpuRIR)
Video Detector (CNN-based object detection)
Audio Detector (3D positional sound source localization)
Audio-Visual Sensor Fusion (multi-object tracking)
Tracking Evaluation (HOTA / TrackEval)

Each stage is implemented as an independent software module and linked through standardized JSONL interfaces.

Software Modules

Refer to the README of each git submodule repository for details.

Visual Scene Simulation (Unity)

Submodule: visualsimulationunity

Generates synthetic RGB and depth images
Simulates fisheye cameras and dynamic human motion
Exports time-stamped 3D ground-truth speaker positions

Room Acoustics Simulation

Submodule: gpuRIR

GPU-accelerated image source method (ISM)
Generates room impulse responses (RIRs)
Renders multi-channel microphone signals for moving speakers

Video Detector

Submodule: cnnvideodetektor

CNN-based video object detection
Processes RGB image streams
Outputs 3D video detections
Note: This module is not fully published due to third-party intellectual property restrictions.

Audio Detector (3D-SSL)

Submodule: ssl4ips

Closed-form analytical 3D positional sound source localization
Uses β-GCC-PHAT for TDOA estimation
Outputs audio localization detections in world coordinates
Note: This module is not fully published due to third-party intellectual property restrictions.

Audio-Visual Sensor Fusion

Submodule: audiovisualsensorfusion

Implements MS-GLMB multi-object tracking
Fuses audio and video detections
Handles track birth, death, and uncertainty

Tracking Evaluation

Submodule: trackeval

Uses TrackEval HOTA reference implementation
Compares predicted tracks with ground-truth
Reports detection, association, and localization metrics

Data Flow and Interfaces

All modules communicate through well-defined JSONL interfaces:

groundtruth_sources.jsonl
audio_localizations.jsonl
video_localizations.jsonl
fusion_tracks.jsonl

This design allows:

Modular replacement of algorithms
Independent evaluation of components
Reproducible experimentation

Generated Experiment (Scenario) Folder Structure

Each simulation session follows a standardized directory structure shared across all software modules:

📁 experiment_001/
├── config.json
├── groundtruth_sources.jsonl
├── audio/
│   └── wav/
│       └── multichannel_audio_<timestamp>.wav
├── video/
│   ├── rgb/
│   │   └── RGB_frame_<timestamp>.png
│   └── depth/
│       └── Depth_frame_<timestamp>.png
├── localization/
│   ├── audio_localizations.jsonl
│   └── video_localizations.jsonl
└── tracking/
    ├── audio_tracking.jsonl
    ├── video_tracking.jsonl
    └── audio_video_tracking.jsonl

Master’s Thesis Results

The following table summarizes the combined tracking results over all scenarios, as reported in the master’s thesis.

Modality	HOTA ↑	DetA ↑	AssA ↑	DetRe ↑	DetPr ↑	AssRe ↑	AssPr ↑	LocA ↑
M1 (Audio)	26.32	18.08	38.40	18.26	93.59	38.92	89.60	92.66
M2 (Video)	68.54	68.61	68.49	70.40	93.10	70.17	93.74	91.83
M3 (Audio + Video)	70.29	69.38	71.24	71.18	93.39	72.88	93.97	92.04

The results demonstrate that for the dataset MOT25A, MOT25V, MOT25AV, the audio-visual fusion outperforms unimodal tracking, particularly in terms of association accuracy and overall HOTA score.

Publication

If you use this code, please cite the following papers:

@INPROCEEDINGS{2025_Sillekens_3DSSL_SRP,
  author={L. Sillekens and O. Rudolf and M. Thißen and I. Penner and S. Seyfarth and E. Hergenröther and J.-P. Akelbein},
  booktitle={2025 10th International Conference on Frontiers of Signal Processing (ICFSP)},
  title={A Non-invasive Measurement System for Evaluating 3D Indoor Sound Source Localization Techniques},
  year={2025},
  pages={79-86},
  keywords={3D positional sound source localization, indoor microphone array geometry,
            beta-gcc-phat, high reverberation time, speaker localization},
  url={http://www.doi.org/10.1109/ICFSP67350.2025.11353692}
}

@article{2025_Oskar_Rudolf,
  title={Implementation of visual people counting algorithms in embedded systems},
  author={O. Rudolf and R. Hecker and M. Thi{\ss}en and L. Sillekens and I. Penner and J.-P. Akelbein and S. Seyfarth and Elke Hergenr{\"o}ther},
  journal={Computer Science Research Notes},
  year={2025},
  url={http://www.doi.org/10.24132/CSRN.2025-4}
}

Contact

Created by: Laurens Sillekens

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
audiovisualsensorfusion @ 2c4b5c7		audiovisualsensorfusion @ 2c4b5c7
cnnvideodetektor @ 69ab2b9		cnnvideodetektor @ 69ab2b9
experiments		experiments
gpuRIR @ 7540406		gpuRIR @ 7540406
ssl4ips @ 1208653		ssl4ips @ 1208653
trackeval @ b024845		trackeval @ b024845
visualsimulationunity @ 13a816a		visualsimulationunity @ 13a816a
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio-Visual Software Architecture

Overview

Software Modules

Visual Scene Simulation (Unity)

Room Acoustics Simulation

Video Detector

Audio Detector (3D-SSL)

Audio-Visual Sensor Fusion

Tracking Evaluation

Data Flow and Interfaces

Generated Experiment (Scenario) Folder Structure

Master’s Thesis Results

Publication

Contact

About

Uh oh!

Releases

Packages

Languages

Laurens26/AudioVisualSoftwareArchitecture

Folders and files

Latest commit

History

Repository files navigation

Audio-Visual Software Architecture

Overview

Software Modules

Visual Scene Simulation (Unity)

Room Acoustics Simulation

Video Detector

Audio Detector (3D-SSL)

Audio-Visual Sensor Fusion

Tracking Evaluation

Data Flow and Interfaces

Generated Experiment (Scenario) Folder Structure

Master’s Thesis Results

Publication

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages