Children's Speech Analysis Pipeline

An end-to-end pipeline for analyzing children's speech from mixed adult-child recordings, using a fine-tuned ASR model and NLP-based linguistic analysis.

📌 Overview

Standard Automatic Speech Recognition (ASR) models struggle with children's atypical and unclear pronunciation patterns. This project addresses that gap by building an audio processing pipeline with speaker assignment and linguistic analysis — and fine-tuning a Swedish Whisper model to better handle children's speech.

🔎 Features

ASR — kb-whisper-large for speech-to-text transcription
Speaker Assignment — Logistic regression model to separate child and adult speech segments
Linguistic Analysis — Stanza and spaCy for NLP-based lexical analysis
LoRA Fine-tuning — Low-Rank Adaptation to fine-tune kb-whisper-large on limited children's speech data within Colab's memory constrains, specializing the model for children's speech

📁 Project Structure

├── data/
│   ├── train/          # Training data
│   └── test/           # Test data
├── models/             # Saved model weights (not tracked in git)
├── data_loader.py      # Data loading utilities
├── functions.py        # Helper functions
├── lr_train.py         # Logistic regression training
└── main.py             # Full pipeline: data loading, model inference, optional LR training

➿ Procedure

1. Prepare Data

Place data in data/train/ and data/test/.

Note: This data is used for the logistic regression speaker assignment model only — separate from the dataset used to fine-tune the Whisper ASR model.

Each dataset split contains:

Multiple .wav audio files (mixed adult-child recordings)
A .csv file with transcriptions in the format: [filename], [transcribed text]

2. Run

python main.py

main.py handles the full pipeline — loading data, loading models, and running inference. Logistic regression training can be enabled and configured via parameters inside main.py.

The LoRA fine-tuned Whisper model was trained separately on Google Colab and is loaded from a private Hugging Face repository.

📃 Results

Reduced WER from 0.23 → 0.157 through data cleaning, text postprocessing (jiwer), and hyperparameter tuning.

Base model — performs better on longer audio files
LoRA model — higher accuracy on shorter audio files, but hallucinates more severely

🔒 Note: Data used in this project is proprietary and not publicly available.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
data_loader.py		data_loader.py
functions.py		functions.py
lr_train.py		lr_train.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Children's Speech Analysis Pipeline

📌 Overview

🔎 Features

📁 Project Structure

➿ Procedure

1. Prepare Data

2. Run

📃 Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Children's Speech Analysis Pipeline

📌 Overview

🔎 Features

📁 Project Structure

➿ Procedure

1. Prepare Data

2. Run

📃 Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages