Skip to content

NishevithaV/HarmonyNet

Repository files navigation

HarmonyNet

HarmonyNet is an end-to-end AI pipeline that converts solo piano audio (MP3/WAV/FLAC) into readable sheet music (PDF). The project is intentionally scoped to solo piano to reduce transcription ambiguity.


Version 2 - SFT Encoder-Decoder (Current, POC)

Proof of concept. V2 replaces the rule-based V1 backend with a fine-tuned encoder-decoder Transformer. The model is trained on MAESTRO v3 and outputs MIDI token sequences directly from audio spectrograms.

Architecture

Component Details
Encoder Whisper (tiny/base) pre-trained audio encoder, Mel spectrogram → hidden states
Decoder Custom 4-layer causal Transformer cross-attends to encoder output
Input 80-bin Mel spectrogram, 10s chunks at 16 kHz (Whisper format: [1, 80, 3000])
Output MIDI token sequence → NoteEvents → MusicXML → PDF
Parameters ~37M encoder (Whisper tiny) + ~10M decoder (trainable in Phase A)

Token Vocabulary

Token range Meaning
0 <BOS> - start of sequence
1 <EOS> - end of sequence
2 <PAD> - padding
3–90 NOTE_ON for MIDI pitches 21–108 (piano range)
91–178 NOTE_OFF for MIDI pitches 21–108
179–378 TIME_SHIFT - 200 steps × 50ms = up to 10s of timing

Total vocabulary size: 379 tokens.

Supervised Fine-Tuning (SFT) - Two-Phase Training

Training uses teacher forcing (ground-truth previous tokens fed as decoder input) across two phases:

Phase A - Decoder only (encoder frozen)

  • Encoder weights locked; only decoder (~10M params) trained
  • Learning rate: 1e-4 for 10 epochs
  • Rationale: Whisper's pre-trained representations are already high quality. Training the decoder first gives it time to learn the token vocabulary before disturbing the encoder.

Phase B - Joint fine-tuning

  • Encoder unfrozen; trained jointly with lower LR (1e-5 encoder, 1e-4 decoder)
  • 5 additional epochs
  • Rationale: After the decoder converges, subtle encoder adaptation to piano-specific spectral patterns improves note boundary precision.

Training

  • Dataset: MAESTRO v3.0.0 (Yamaha Disklavier recordings with aligned MIDI ground truth)
  • Subset used: 50 pieces (~8 hours of audio)
  • Segments: 10-second overlapping windows; spectrogram [1, 80, 3000], token sequence up to 512 tokens
  • Loss: Cross-entropy on token predictions (teacher forcing)
  • Hardware: Apple M-series (MPS backend), batch size 8
Phase Epochs Train Loss Val Loss
A (frozen encoder) 10 ~2.4 ~2.7
B (joint) 5 ~2.3 ~2.71

Evaluation

Note-level Precision / Recall / F1 using the mir_eval standard:

  • A predicted note is a True Positive if it matches a reference note on the same pitch with onset within 50ms
  • Each reference note can only be matched once (greedy matching by onset proximity)
python -m src.v2.evaluate --checkpoint models/v2/best_model.pt --max-segments 20 --split validation

POC caveat: The model was trained on a 50-piece subset. F1 scores reflect early-stage training and will improve with more data and training compute. The architecture is sound; scaling data and compute is the primary path to production quality.

V2 CLI Usage

# Transcribe with V2 model
python -m src.cli transcribe data/inputs/fur_elise.mp3 --model v2 -o output_v2.pdf

# With a specific checkpoint
python -m src.cli transcribe data/inputs/fur_elise.mp3 --model v2 --checkpoint models/v2/best_model.pt

# MusicXML only (no PDF)
python -m src.cli transcribe data/inputs/fur_elise.mp3 --model v2 --no-pdf

V2 File Map

src/v2/
├── model.py             # PianoTranscriptionModel (Whisper encoder + causal decoder)
├── tokenizer.py         # Token vocabulary, encode_notes(), decode_tokens()
├── spectrogram.py       # WhisperSpectrogramExtractor → [1, 80, 3000]
├── dataset.py           # MAESTRODataset, segment pipeline, teacher-forcing batches
├── train.py             # Two-phase SFT Trainer (Phase A + Phase B)
├── transcribe.py        # Inference bridge: audio → chunks → NoteEvents → TranscriptionResult
└── evaluate.py          # Note-level P/R/F1 evaluation against MAESTRO MIDI ground truth
models/v2/
└── best_model.pt        # Best checkpoint (val_loss=2.71, 50 pieces, 15 epochs)

Known V2 Limitations (POC)

  • Small training set: 50 pieces is far below the full MAESTRO dataset (~1,200 pieces). F1 will improve with scale.
  • O(n²) inference: No KV cache — nn.TransformerDecoder re-runs full self-attention over all past tokens at O(n²). Inference is slow for long sequences.
  • Single clef output: No grand staff splitting (treble only).
  • Token budget: max_gen_tokens=128 per 10s chunk trades recall for speed (dense passages may be truncated).

Version 1 - Baseline System (Complete)

V1 builds a working end-to-end pipeline using basic-pitch (Spotify's ICASSP 2022 model) as the transcription backend. No custom ML training required.

Audio → basic-pitch CNN → NoteEvents → Quantizer → MusicXML → PDF

HarmonyNet pipeline

  • Accurate pitch detection across the full 88-key piano range (MIDI 21–108)
  • Correct onset timing and note durations
  • Configurable tempo, time signature, and detection thresholds
  • Slightly poor rest detection; pedal/sustain not modeled

V1 CLI Usage

# Generate sheet music PDF
python -m src.cli transcribe input.mp3 -o output.pdf

# With custom tempo and time signature
python -m src.cli transcribe data/inputs/Gymnopedie.mp3 -o data/outputs/gymnopedie.pdf --tempo 54 --time-sig 3/4

# MusicXML only
python -m src.cli transcribe input.mp3 --no-pdf
Option Default Description
--tempo 120 Tempo in BPM
--time-sig 4/4 Time signature
--onset-threshold 0.5 Onset detection sensitivity (0–1)
--frame-threshold 0.3 Note frame sensitivity (0–1)
--title filename Score title
--no-pdf false Output MusicXML only
--keep-musicxml false Keep MusicXML alongside PDF

V1 Technical Notes

  • ONNX Runtime is used instead of TensorFlow for Python 3.12 compatibility. See docs/inference_guide.md.
  • basic-pitch (ICASSP 2022 model) provides the CNN that produces onset, note, and contour predictions from Harmonic CQT spectrograms.
  • music21 handles MusicXML encoding. MuseScore handles PDF rendering.
  • A scipy compatibility shim patches scipy.signal.gaussian for scipy 1.14+ (see src/inference.py).

Web API

The V1 pipeline is exposed as a REST API built on FastAPI + Celery + Redis (Upstash), with file storage on Cloudflare R2 and optional OpenAI musical analysis.

Client  →  FastAPI  →  Celery task queue  →  Worker
                           ↑                    ↓
                        Redis (Upstash)     V1 pipeline → R2 storage
                                                ↓
                                        OpenAI GPT-4o-mini (optional)
Component Role
FastAPI Accepts audio uploads, dispatches jobs, serves status/download endpoints
Celery Runs transcription in the background so the HTTP request returns immediately
Redis (Upstash) Broker + result store for Celery, supports rediss:// SSL
Cloudflare R2 Stores completed PDF and MusicXML outputs, served via presigned URLs
OpenAI GPT-4o-mini Optional — identifies the piece, estimates difficulty, gives a practice tip

Endpoints:

Method Path Description
POST /transcribe Upload audio, returns job_id
GET /status/{job_id} Poll job state (pending → processing → done)
GET /result/{job_id}/pdf Redirect to presigned R2 PDF URL
GET /result/{job_id}/musicxml Redirect to presigned R2 MusicXML URL

Running locally:

# Terminal 1 — API server
uvicorn api.main:app --reload

# Terminal 2 — Celery worker
celery -A api.celery_app.celery worker --loglevel=info

Environment variables:

Variable Required Description
UPSTASH_REDIS_URL Yes (or REDIS_URL) Redis broker/backend (rediss:// for Upstash SSL)
REDIS_URL Fallback Plain Redis URL for local dev
R2_ACCOUNT_ID Yes Cloudflare R2 account
R2_ACCESS_KEY_ID Yes R2 credentials
R2_SECRET_ACCESS_KEY Yes R2 credentials
R2_BUCKET Yes R2 bucket name
OPENAI_API_KEY No Enables GPT-4o-mini analysis; pipeline works without it
FRONTEND_URL No CORS origin (default: http://localhost:3000)

Setup and Requirements

Python 3.12+ required.

git clone https://github.com/NishevithaV/HarmonyNet.git
cd HarmonyNet
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

Check all dependencies:

python -m src.cli check

MuseScore (optional, for PDF rendering)

Install MuseScore 4 from: https://musescore.org/en/download

macOS: /Applications/MuseScore 4.app/Contents/MacOS/mscore
Linux: /usr/bin/mscore or /usr/local/bin/mscore4
Windows: C:\Program Files\MuseScore 4\bin\MuseScore4.exe

If MuseScore is not installed, the pipeline still produces MusicXML output that can be opened in any notation software.

V2 Model Checkpoint

V2 requires a trained checkpoint at models/v2/best_model.pt. It will auto-download from HuggingFace the first time you run --model v2.

Checkpoint hosted at: https://huggingface.co/nishevithav/harmonynet-v2

To train from scratch (requires MAESTRO v3 audio data):

python -m src.v2.train

Training on 50 pieces takes ~2–3 hours on an Apple M-series chip.


Sample Outputs

Pre-generated PDFs are in data/outputs/. V1 outputs used explicit tempo and time signature.

Piece Tempo Time Sig Notes detected Output
Für Elise 72 BPM 3/8 1747 fur_elise.pdf
Gymnopedie No. 1 54 BPM 3/4 841 gymnopedie.pdf
C Major Scale 120 BPM 4/4 8 c_major_scale.pdf
Für Elise (V2 model) 246 fur_elise_v2.pdf

About

E2E AI pipeline that converts solo piano audio (MP3/WAV) into readable sheet music

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors