An end-to-end pipeline for analyzing children's speech from mixed adult-child recordings, using a fine-tuned ASR model and NLP-based linguistic analysis.
Standard Automatic Speech Recognition (ASR) models struggle with children's atypical and unclear pronunciation patterns. This project addresses that gap by building an audio processing pipeline with speaker assignment and linguistic analysis — and fine-tuning a Swedish Whisper model to better handle children's speech.
- ASR — kb-whisper-large for speech-to-text transcription
- Speaker Assignment — Logistic regression model to separate child and adult speech segments
- Linguistic Analysis — Stanza and spaCy for NLP-based lexical analysis
- LoRA Fine-tuning — Low-Rank Adaptation to fine-tune kb-whisper-large on limited children's speech data within Colab's memory constrains, specializing the model for children's speech
├── data/
│ ├── train/ # Training data
│ └── test/ # Test data
├── models/ # Saved model weights (not tracked in git)
├── data_loader.py # Data loading utilities
├── functions.py # Helper functions
├── lr_train.py # Logistic regression training
└── main.py # Full pipeline: data loading, model inference, optional LR training
Place data in data/train/ and data/test/.
Note: This data is used for the logistic regression speaker assignment model only — separate from the dataset used to fine-tune the Whisper ASR model.
Each dataset split contains:
- Multiple
.wavaudio files (mixed adult-child recordings) - A
.csvfile with transcriptions in the format:[filename], [transcribed text]
python main.pymain.py handles the full pipeline — loading data, loading models, and running inference. Logistic regression training can be enabled and configured via parameters inside main.py.
The LoRA fine-tuned Whisper model was trained separately on Google Colab and is loaded from a private Hugging Face repository.
Reduced WER from 0.23 → 0.157 through data cleaning, text postprocessing (jiwer), and hyperparameter tuning.
- Base model — performs better on longer audio files
- LoRA model — higher accuracy on shorter audio files, but hallucinates more severely
🔒 Note: Data used in this project is proprietary and not publicly available.