HarmonyNet

HarmonyNet is an end-to-end AI pipeline that converts solo piano audio (MP3/WAV/FLAC) into readable sheet music (PDF). The project is intentionally scoped to solo piano to reduce transcription ambiguity.

Version 2 - SFT Encoder-Decoder (Current, POC)

Proof of concept. V2 replaces the rule-based V1 backend with a fine-tuned encoder-decoder Transformer. The model is trained on MAESTRO v3 and outputs MIDI token sequences directly from audio spectrograms.

Architecture

Component	Details
Encoder	Whisper (tiny/base) pre-trained audio encoder, Mel spectrogram → hidden states
Decoder	Custom 4-layer causal Transformer cross-attends to encoder output
Input	80-bin Mel spectrogram, 10s chunks at 16 kHz (Whisper format: [1, 80, 3000])
Output	MIDI token sequence → NoteEvents → MusicXML → PDF
Parameters	~37M encoder (Whisper tiny) + ~10M decoder (trainable in Phase A)

Token Vocabulary

Token range	Meaning
`0`	`<BOS>` - start of sequence
`1`	`<EOS>` - end of sequence
`2`	`<PAD>` - padding
`3–90`	`NOTE_ON` for MIDI pitches 21–108 (piano range)
`91–178`	`NOTE_OFF` for MIDI pitches 21–108
`179–378`	`TIME_SHIFT` - 200 steps × 50ms = up to 10s of timing

Total vocabulary size: 379 tokens.

Supervised Fine-Tuning (SFT) - Two-Phase Training

Training uses teacher forcing (ground-truth previous tokens fed as decoder input) across two phases:

Phase A - Decoder only (encoder frozen)

Encoder weights locked; only decoder (~10M params) trained
Learning rate: 1e-4 for 10 epochs
Rationale: Whisper's pre-trained representations are already high quality. Training the decoder first gives it time to learn the token vocabulary before disturbing the encoder.

Phase B - Joint fine-tuning

Encoder unfrozen; trained jointly with lower LR (1e-5 encoder, 1e-4 decoder)
5 additional epochs
Rationale: After the decoder converges, subtle encoder adaptation to piano-specific spectral patterns improves note boundary precision.

Training

Dataset: MAESTRO v3.0.0 (Yamaha Disklavier recordings with aligned MIDI ground truth)
Subset used: 50 pieces (~8 hours of audio)
Segments: 10-second overlapping windows; spectrogram [1, 80, 3000], token sequence up to 512 tokens
Loss: Cross-entropy on token predictions (teacher forcing)
Hardware: Apple M-series (MPS backend), batch size 8

Phase	Epochs	Train Loss	Val Loss
A (frozen encoder)	10	~2.4	~2.7
B (joint)	5	~2.3	~2.71

Evaluation

Note-level Precision / Recall / F1 using the mir_eval standard:

A predicted note is a True Positive if it matches a reference note on the same pitch with onset within 50ms
Each reference note can only be matched once (greedy matching by onset proximity)

python -m src.v2.evaluate --checkpoint models/v2/best_model.pt --max-segments 20 --split validation

POC caveat: The model was trained on a 50-piece subset. F1 scores reflect early-stage training and will improve with more data and training compute. The architecture is sound; scaling data and compute is the primary path to production quality.

V2 CLI Usage

# Transcribe with V2 model
python -m src.cli transcribe data/inputs/fur_elise.mp3 --model v2 -o output_v2.pdf

# With a specific checkpoint
python -m src.cli transcribe data/inputs/fur_elise.mp3 --model v2 --checkpoint models/v2/best_model.pt

# MusicXML only (no PDF)
python -m src.cli transcribe data/inputs/fur_elise.mp3 --model v2 --no-pdf

V2 File Map

src/v2/
├── model.py             # PianoTranscriptionModel (Whisper encoder + causal decoder)
├── tokenizer.py         # Token vocabulary, encode_notes(), decode_tokens()
├── spectrogram.py       # WhisperSpectrogramExtractor → [1, 80, 3000]
├── dataset.py           # MAESTRODataset, segment pipeline, teacher-forcing batches
├── train.py             # Two-phase SFT Trainer (Phase A + Phase B)
├── transcribe.py        # Inference bridge: audio → chunks → NoteEvents → TranscriptionResult
└── evaluate.py          # Note-level P/R/F1 evaluation against MAESTRO MIDI ground truth
models/v2/
└── best_model.pt        # Best checkpoint (val_loss=2.71, 50 pieces, 15 epochs)

Known V2 Limitations (POC)

Small training set: 50 pieces is far below the full MAESTRO dataset (~1,200 pieces). F1 will improve with scale.
O(n²) inference: No KV cache — nn.TransformerDecoder re-runs full self-attention over all past tokens at O(n²). Inference is slow for long sequences.
Single clef output: No grand staff splitting (treble only).
Token budget: max_gen_tokens=128 per 10s chunk trades recall for speed (dense passages may be truncated).

Version 1 - Baseline System (Complete)

V1 builds a working end-to-end pipeline using basic-pitch (Spotify's ICASSP 2022 model) as the transcription backend. No custom ML training required.

Audio → basic-pitch CNN → NoteEvents → Quantizer → MusicXML → PDF

Accurate pitch detection across the full 88-key piano range (MIDI 21–108)
Correct onset timing and note durations
Configurable tempo, time signature, and detection thresholds
Slightly poor rest detection; pedal/sustain not modeled

V1 CLI Usage

# Generate sheet music PDF
python -m src.cli transcribe input.mp3 -o output.pdf

# With custom tempo and time signature
python -m src.cli transcribe data/inputs/Gymnopedie.mp3 -o data/outputs/gymnopedie.pdf --tempo 54 --time-sig 3/4

# MusicXML only
python -m src.cli transcribe input.mp3 --no-pdf

Option	Default	Description
`--tempo`	120	Tempo in BPM
`--time-sig`	4/4	Time signature
`--onset-threshold`	0.5	Onset detection sensitivity (0–1)
`--frame-threshold`	0.3	Note frame sensitivity (0–1)
`--title`	filename	Score title
`--no-pdf`	false	Output MusicXML only
`--keep-musicxml`	false	Keep MusicXML alongside PDF

V1 Technical Notes

ONNX Runtime is used instead of TensorFlow for Python 3.12 compatibility. See docs/inference_guide.md.
basic-pitch (ICASSP 2022 model) provides the CNN that produces onset, note, and contour predictions from Harmonic CQT spectrograms.
music21 handles MusicXML encoding. MuseScore handles PDF rendering.
A scipy compatibility shim patches scipy.signal.gaussian for scipy 1.14+ (see src/inference.py).

Web API

The V1 pipeline is exposed as a REST API built on FastAPI + Celery + Redis (Upstash), with file storage on Cloudflare R2 and optional OpenAI musical analysis.

Client  →  FastAPI  →  Celery task queue  →  Worker
                           ↑                    ↓
                        Redis (Upstash)     V1 pipeline → R2 storage
                                                ↓
                                        OpenAI GPT-4o-mini (optional)

Component	Role
FastAPI	Accepts audio uploads, dispatches jobs, serves status/download endpoints
Celery	Runs transcription in the background so the HTTP request returns immediately
Redis (Upstash)	Broker + result store for Celery, supports `rediss://` SSL
Cloudflare R2	Stores completed PDF and MusicXML outputs, served via presigned URLs
OpenAI GPT-4o-mini	Optional — identifies the piece, estimates difficulty, gives a practice tip

Endpoints:

Method	Path	Description
`POST`	`/transcribe`	Upload audio, returns `job_id`
`GET`	`/status/{job_id}`	Poll job state (`pending → processing → done`)
`GET`	`/result/{job_id}/pdf`	Redirect to presigned R2 PDF URL
`GET`	`/result/{job_id}/musicxml`	Redirect to presigned R2 MusicXML URL

Running locally:

# Terminal 1 — API server
uvicorn api.main:app --reload

# Terminal 2 — Celery worker
celery -A api.celery_app.celery worker --loglevel=info

Environment variables:

Variable	Required	Description
`UPSTASH_REDIS_URL`	Yes (or `REDIS_URL`)	Redis broker/backend (`rediss://` for Upstash SSL)
`REDIS_URL`	Fallback	Plain Redis URL for local dev
`R2_ACCOUNT_ID`	Yes	Cloudflare R2 account
`R2_ACCESS_KEY_ID`	Yes	R2 credentials
`R2_SECRET_ACCESS_KEY`	Yes	R2 credentials
`R2_BUCKET`	Yes	R2 bucket name
`OPENAI_API_KEY`	No	Enables GPT-4o-mini analysis; pipeline works without it
`FRONTEND_URL`	No	CORS origin (default: `http://localhost:3000`)

Setup and Requirements

Python 3.12+ required.

git clone https://github.com/NishevithaV/HarmonyNet.git
cd HarmonyNet
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

Check all dependencies:

python -m src.cli check

MuseScore (optional, for PDF rendering)

Install MuseScore 4 from: https://musescore.org/en/download

macOS: /Applications/MuseScore 4.app/Contents/MacOS/mscore
Linux: /usr/bin/mscore or /usr/local/bin/mscore4
Windows: C:\Program Files\MuseScore 4\bin\MuseScore4.exe

If MuseScore is not installed, the pipeline still produces MusicXML output that can be opened in any notation software.

V2 Model Checkpoint

V2 requires a trained checkpoint at models/v2/best_model.pt. It will auto-download from HuggingFace the first time you run --model v2.

Checkpoint hosted at: https://huggingface.co/nishevithav/harmonynet-v2

To train from scratch (requires MAESTRO v3 audio data):

python -m src.v2.train

Training on 50 pieces takes ~2–3 hours on an Apple M-series chip.

Sample Outputs

Pre-generated PDFs are in data/outputs/. V1 outputs used explicit tempo and time signature.

Piece	Tempo	Time Sig	Notes detected	Output
Für Elise	72 BPM	3/8	1747	fur_elise.pdf
Gymnopedie No. 1	54 BPM	3/4	841	gymnopedie.pdf
C Major Scale	120 BPM	4/4	8	c_major_scale.pdf
Für Elise (V2 model)	—	—	246	fur_elise_v2.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
api		api
assets		assets
data		data
docs		docs
frontend		frontend
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.worker		Dockerfile.worker
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements-docker.txt		requirements-docker.txt
requirements.txt		requirements.txt
worker_start.sh		worker_start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HarmonyNet

Version 2 - SFT Encoder-Decoder (Current, POC)

Architecture

Token Vocabulary

Supervised Fine-Tuning (SFT) - Two-Phase Training

Training

Evaluation

V2 CLI Usage

V2 File Map

Known V2 Limitations (POC)

Version 1 - Baseline System (Complete)

V1 CLI Usage

V1 Technical Notes

Web API

Setup and Requirements

MuseScore (optional, for PDF rendering)

V2 Model Checkpoint

Sample Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HarmonyNet

Version 2 - SFT Encoder-Decoder (Current, POC)

Architecture

Token Vocabulary

Supervised Fine-Tuning (SFT) - Two-Phase Training

Training

Evaluation

V2 CLI Usage

V2 File Map

Known V2 Limitations (POC)

Version 1 - Baseline System (Complete)

V1 CLI Usage

V1 Technical Notes

Web API

Setup and Requirements

MuseScore (optional, for PDF rendering)

V2 Model Checkpoint

Sample Outputs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages