Authors: Daria Alexandrova, Kamilya Shakirova
A systematic investigation into how decoding strategies (temperature, top‑k, top‑p) affect the quality, diversity, and prompt adherence of music generated by Meta's MusicGen model. This repository includes a complete pipeline for fine‑tuning on custom audio, running controlled parameter sweeps, evaluating with automated metrics (FAD, CLAP, repetition, diversity), and conducting blind human listening studies.
This project extends the original AudioCraft MusicGen model by:
- Fine‑tuning on an out‑of‑distribution genre — the ULTRAKILL industrial breakcore soundtrack — using LoRA adapters to keep training tractable on consumer hardware.
- Sweeping decoding parameters independently to avoid confounding effects: temperature (0.5–1.5), top‑k (50–500), top‑p (0.85–0.99), and greedy argmax.
- Measuring four complementary metrics for each condition:
- FAD (Fréchet Audio Distance) — overall audio quality.
- CLAP similarity — prompt adherence.
- Repetition score & loop ratio — intra‑clip looping behaviour.
- Diversity score — spread of samples from the same prompt.
- Validating findings with a human study using a custom web‑based pairwise comparison interface.
The main conclusion: temperature = 1.0 (unmodified sampling) provides the best balance of quality and prompt adherence, while greedy decoding collapses into repetitive noise and should be avoided.
| Metric | Description | Direction |
|---|---|---|
| FAD | Fréchet distance between VGGish embeddings of generated and reference audio. | ↓ lower = better |
| CLAP similarity | Cosine similarity between CLAP text and audio embeddings. | ↑ higher = better prompt match |
| Repetition score | Mean off‑diagonal cosine similarity of mel‑spectrogram frames. | ↓ lower = less looping |
| Diversity score | Mean pairwise L2 distance between mean‑mel embeddings of samples. | ↑ higher = more varied outputs |
MusicGen/
├── config/
│ └── default.yaml # Central configuration for sweeps & evaluation
├── data/
│ ├── ultrakill/ # Reference audio files
│ ├── ultrakill_manifests/ # Train/valid/test JSONL manifests
│ │ ├── train/data.jsonl
│ │ ├── valid/data.jsonl
│ │ └── test/data.jsonl
│ └── ultrakill_prompts.jsonl # Evaluation prompts
├── results/
│ ├── audit/ # Greedy decoding invariance audit
│ ├── sweep_ultrakill/ # Generated audio & manifest.json
│ ├── metrics_ultrakill/ # Per‑condition metric JSONs & summary.json
│ └── human_study/ # Pairwise comparison tasks & responses
├── src/
│ ├── data/
│ │ ├── local_dataset.py # Build manifests from local audio folder
│ │ └── pipeline.py # FMA dataset download & manifest creation
│ ├── human_study/
│ │ ├── generate_pairs.py # Generate pairwise tasks from manifest
│ │ └── viewer.html # Web interface for blind listening tests
│ ├── metrics/
│ │ ├── evaluate_all.py # Run all metrics for a sweep
│ │ ├── prompt_adherence.py # CLAP‑based prompt similarity
│ │ ├── repetition.py # Repetition & diversity scores
│ │ └── analysis.py # Correlation heatmap & analysis
│ ├── audit_decoding.py # Verify greedy temperature‑invariance
│ ├── evaluate.py # FAD computation wrapper
│ ├── generate.py # Single‑shot generation script
│ ├── run_experiments.py # Controlled sweep over decoding params
│ └── train.py # Fine‑tune MusicGen (LoRA / layer‑wise)
├── requirements.txt
└── README.md
Python 3.11 is required. AudioCraft has strict dependency constraints; follow the two‑step install carefully.
# Install base dependencies
pip install -r requirements.txt
# Install AudioCraft separately (avoids torch/xformers conflicts)
pip install --no-deps "audiocraft @ git+https://github.com/facebookresearch/audiocraft.git"For LoRA fine‑tuning, also install peft:
pip install peft! MusicGen‑small requires ~8 GB VRAM for a batch size of 2 with 30‑second clips. Reduce
--durationor use--batch_size 1if memory‑constrained.
Place your audio files (.flac, .wav, .mp3, etc.) and optional .txt sidecars in a single folder. If a sidecar is missing, a default label is used.
python -m src.data.local_dataset \
--input_dir ./my_dataset \
--output_dir ./data/custom_manifests \
--default_label "ULTRAKILL OST, industrial metal breakcore, Heaven Pierce Her" \
--split 0.8 0.1 0.1This creates train/, valid/, test/ folders with data.jsonl manifests.
LoRA is suitable for small datasets (rank=16, alpha=32). Training on an RTX 4060 with 3 epochs takes ~7 hours for 30‑second clips.
python -m src.train \
--manifest_dir ./data/ultrakill_manifests \
--audio_dir ./my_dataset \
--output_dir ./trained_model \
--lora \
--epochs 5 \
--batch_size 2 \
--duration 30 \
--lr 1e-4Without --lora, only the last 4 transformer layers are unfrozen.
Generate audio under different sampling strategies.
# Temperature sweep only
python -m src.run_experiments --group temp --num_samples 5
# Full sweep across all conditions
python -m src.run_experiments --group all --output_dir ./results/sweep_ultrakill
# Supply custom prompts
python -m src.run_experiments --group temp --prompts_file ./data/eval_prompts.jsonlAvailable groups: greedy, temp, topk, topp, all.
Compute FAD (requires reference audio), CLAP prompt adherence, repetition, and diversity for every condition.
python -m src.metrics.evaluate_all \
--manifest ./results/sweep_ultrakill/manifest.json \
--reference ./data/ultrakill \
--output_dir ./results/metrics_ultrakill \
--device cuda
# Skip FAD (no reference audio) or CLAP (no GPU / slow)
python -m src.metrics.evaluate_all \
--manifest ./results/sweep_ultrakill/manifest.json \
--reference ./data/ultrakill \
--skip_fad --output_dir ./results/metrics_ultrakillSkip FAD or CLAP with --skip_fad / --skip_clap.
Step 1 — Generate pairwise tasks:
python -m src.human_study.generate_pairs \
--manifest ./results/sweep_ultrakill/manifest.json \
--conditions greedy temp_1.0 temp_1.5 topk_250 topp_0.95 \
--pairs_per_prompt 3 \
--output_file ./results/human_study/pairs.json \
--project_root . # makes audio paths relative for web servingStep 2 — Serve the web interface:
cd /path/to/MusicGen
python -m http.server 8080Open http://localhost:8080/src/human_study/viewer.html in a browser, load pairs.json, and start rating. Responses can be exported as CSV.
Generate a correlation heatmap from the summary metrics:
python -m src.metrics.analysis --metrics_path ./results/metrics_ultrakill/summary.json