MusicGen - Conditional Genre Music Generation

Authors: Daria Alexandrova, Kamilya Shakirova

A systematic investigation into how decoding strategies (temperature, top‑k, top‑p) affect the quality, diversity, and prompt adherence of music generated by Meta's MusicGen model. This repository includes a complete pipeline for fine‑tuning on custom audio, running controlled parameter sweeps, evaluating with automated metrics (FAD, CLAP, repetition, diversity), and conducting blind human listening studies.

Overview

This project extends the original AudioCraft MusicGen model by:

Fine‑tuning on an out‑of‑distribution genre — the ULTRAKILL industrial breakcore soundtrack — using LoRA adapters to keep training tractable on consumer hardware.
Sweeping decoding parameters independently to avoid confounding effects: temperature (0.5–1.5), top‑k (50–500), top‑p (0.85–0.99), and greedy argmax.
Measuring four complementary metrics for each condition:
- FAD (Fréchet Audio Distance) — overall audio quality.
- CLAP similarity — prompt adherence.
- Repetition score & loop ratio — intra‑clip looping behaviour.
- Diversity score — spread of samples from the same prompt.
Validating findings with a human study using a custom web‑based pairwise comparison interface.

The main conclusion: temperature = 1.0 (unmodified sampling) provides the best balance of quality and prompt adherence, while greedy decoding collapses into repetitive noise and should be avoided.

Metrics Explained

Metric	Description	Direction
FAD	Fréchet distance between VGGish embeddings of generated and reference audio.	↓ lower = better
CLAP similarity	Cosine similarity between CLAP text and audio embeddings.	↑ higher = better prompt match
Repetition score	Mean off‑diagonal cosine similarity of mel‑spectrogram frames.	↓ lower = less looping
Diversity score	Mean pairwise L2 distance between mean‑mel embeddings of samples.	↑ higher = more varied outputs

Project Structure

MusicGen/
├── config/
│   └── default.yaml               # Central configuration for sweeps & evaluation
├── data/
│   ├── ultrakill/                 # Reference audio files
│   ├── ultrakill_manifests/       # Train/valid/test JSONL manifests
│   │   ├── train/data.jsonl
│   │   ├── valid/data.jsonl
│   │   └── test/data.jsonl
│   └── ultrakill_prompts.jsonl    # Evaluation prompts
├── results/
│   ├── audit/                     # Greedy decoding invariance audit
│   ├── sweep_ultrakill/           # Generated audio & manifest.json
│   ├── metrics_ultrakill/         # Per‑condition metric JSONs & summary.json
│   └── human_study/               # Pairwise comparison tasks & responses
├── src/
│   ├── data/
│   │   ├── local_dataset.py       # Build manifests from local audio folder
│   │   └── pipeline.py            # FMA dataset download & manifest creation
│   ├── human_study/
│   │   ├── generate_pairs.py      # Generate pairwise tasks from manifest
│   │   └── viewer.html            # Web interface for blind listening tests
│   ├── metrics/
│   │   ├── evaluate_all.py        # Run all metrics for a sweep
│   │   ├── prompt_adherence.py    # CLAP‑based prompt similarity
│   │   ├── repetition.py          # Repetition & diversity scores
│   │   └── analysis.py            # Correlation heatmap & analysis
│   ├── audit_decoding.py          # Verify greedy temperature‑invariance
│   ├── evaluate.py                # FAD computation wrapper
│   ├── generate.py                # Single‑shot generation script
│   ├── run_experiments.py         # Controlled sweep over decoding params
│   └── train.py                   # Fine‑tune MusicGen (LoRA / layer‑wise)
├── requirements.txt
└── README.md

Installation

Python 3.11 is required. AudioCraft has strict dependency constraints; follow the two‑step install carefully.

# Install base dependencies
pip install -r requirements.txt

# Install AudioCraft separately (avoids torch/xformers conflicts)
pip install --no-deps "audiocraft @ git+https://github.com/facebookresearch/audiocraft.git"

For LoRA fine‑tuning, also install peft:

pip install peft

! MusicGen‑small requires ~8 GB VRAM for a batch size of 2 with 30‑second clips. Reduce --duration or use --batch_size 1 if memory‑constrained.

Usage

1. Prepare a Custom Dataset

Place your audio files (.flac, .wav, .mp3, etc.) and optional .txt sidecars in a single folder. If a sidecar is missing, a default label is used.

python -m src.data.local_dataset \
    --input_dir ./my_dataset \
    --output_dir ./data/custom_manifests \
    --default_label "ULTRAKILL OST, industrial metal breakcore, Heaven Pierce Her" \
    --split 0.8 0.1 0.1

This creates train/, valid/, test/ folders with data.jsonl manifests.

2. Fine‑tune MusicGen

LoRA is suitable for small datasets (rank=16, alpha=32). Training on an RTX 4060 with 3 epochs takes ~7 hours for 30‑second clips.

python -m src.train \
    --manifest_dir ./data/ultrakill_manifests \
    --audio_dir ./my_dataset \
    --output_dir ./trained_model \
    --lora \
    --epochs 5 \
    --batch_size 2 \
    --duration 30 \
    --lr 1e-4

Without --lora, only the last 4 transformer layers are unfrozen.

3. Run Decoding Sweep

Generate audio under different sampling strategies.

# Temperature sweep only
python -m src.run_experiments --group temp --num_samples 5

# Full sweep across all conditions
python -m src.run_experiments --group all --output_dir ./results/sweep_ultrakill

# Supply custom prompts
python -m src.run_experiments --group temp --prompts_file ./data/eval_prompts.jsonl

Available groups: greedy, temp, topk, topp, all.

4. Evaluate with Automated Metrics

Compute FAD (requires reference audio), CLAP prompt adherence, repetition, and diversity for every condition.

python -m src.metrics.evaluate_all \
    --manifest ./results/sweep_ultrakill/manifest.json \
    --reference ./data/ultrakill \
    --output_dir ./results/metrics_ultrakill \
    --device cuda

# Skip FAD (no reference audio) or CLAP (no GPU / slow)
python -m src.metrics.evaluate_all \
    --manifest  ./results/sweep_ultrakill/manifest.json \
    --reference ./data/ultrakill \
    --skip_fad  --output_dir ./results/metrics_ultrakill

Skip FAD or CLAP with --skip_fad / --skip_clap.

5. Human Listening Study

Step 1 — Generate pairwise tasks:

python -m src.human_study.generate_pairs \
    --manifest ./results/sweep_ultrakill/manifest.json \
    --conditions greedy temp_1.0 temp_1.5 topk_250 topp_0.95 \
    --pairs_per_prompt 3 \
    --output_file ./results/human_study/pairs.json \
    --project_root .   # makes audio paths relative for web serving

Step 2 — Serve the web interface:

cd /path/to/MusicGen
python -m http.server 8080

Open http://localhost:8080/src/human_study/viewer.html in a browser, load pairs.json, and start rating. Responses can be exported as CSV.

6. Analyze Correlations

Generate a correlation heatmap from the summary metrics:

python -m src.metrics.analysis --metrics_path ./results/metrics_ultrakill/summary.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MusicGen - Conditional Genre Music Generation

Overview

Metrics Explained

Project Structure

Installation

Usage

1. Prepare a Custom Dataset

2. Fine‑tune MusicGen

3. Run Decoding Sweep

4. Evaluate with Automated Metrics

5. Human Listening Study

6. Analyze Correlations

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

MusicGen - Conditional Genre Music Generation

Overview

Metrics Explained

Project Structure

Installation

Usage

1. Prepare a Custom Dataset

2. Fine‑tune MusicGen

3. Run Decoding Sweep

4. Evaluate with Automated Metrics

5. Human Listening Study

6. Analyze Correlations