HarmonyNet is an end-to-end AI pipeline that converts solo piano audio (MP3/WAV/FLAC) into readable sheet music (PDF). The project is intentionally scoped to solo piano to reduce transcription ambiguity.
Proof of concept. V2 replaces the rule-based V1 backend with a fine-tuned encoder-decoder Transformer. The model is trained on MAESTRO v3 and outputs MIDI token sequences directly from audio spectrograms.
| Component | Details |
|---|---|
| Encoder | Whisper (tiny/base) pre-trained audio encoder, Mel spectrogram → hidden states |
| Decoder | Custom 4-layer causal Transformer cross-attends to encoder output |
| Input | 80-bin Mel spectrogram, 10s chunks at 16 kHz (Whisper format: [1, 80, 3000]) |
| Output | MIDI token sequence → NoteEvents → MusicXML → PDF |
| Parameters | ~37M encoder (Whisper tiny) + ~10M decoder (trainable in Phase A) |
| Token range | Meaning |
|---|---|
0 |
<BOS> - start of sequence |
1 |
<EOS> - end of sequence |
2 |
<PAD> - padding |
3–90 |
NOTE_ON for MIDI pitches 21–108 (piano range) |
91–178 |
NOTE_OFF for MIDI pitches 21–108 |
179–378 |
TIME_SHIFT - 200 steps × 50ms = up to 10s of timing |
Total vocabulary size: 379 tokens.
Training uses teacher forcing (ground-truth previous tokens fed as decoder input) across two phases:
Phase A - Decoder only (encoder frozen)
- Encoder weights locked; only decoder (~10M params) trained
- Learning rate: 1e-4 for 10 epochs
- Rationale: Whisper's pre-trained representations are already high quality. Training the decoder first gives it time to learn the token vocabulary before disturbing the encoder.
Phase B - Joint fine-tuning
- Encoder unfrozen; trained jointly with lower LR (1e-5 encoder, 1e-4 decoder)
- 5 additional epochs
- Rationale: After the decoder converges, subtle encoder adaptation to piano-specific spectral patterns improves note boundary precision.
- Dataset: MAESTRO v3.0.0 (Yamaha Disklavier recordings with aligned MIDI ground truth)
- Subset used: 50 pieces (~8 hours of audio)
- Segments: 10-second overlapping windows; spectrogram [1, 80, 3000], token sequence up to 512 tokens
- Loss: Cross-entropy on token predictions (teacher forcing)
- Hardware: Apple M-series (MPS backend), batch size 8
| Phase | Epochs | Train Loss | Val Loss |
|---|---|---|---|
| A (frozen encoder) | 10 | ~2.4 | ~2.7 |
| B (joint) | 5 | ~2.3 | ~2.71 |
Note-level Precision / Recall / F1 using the mir_eval standard:
- A predicted note is a True Positive if it matches a reference note on the same pitch with onset within 50ms
- Each reference note can only be matched once (greedy matching by onset proximity)
python -m src.v2.evaluate --checkpoint models/v2/best_model.pt --max-segments 20 --split validationPOC caveat: The model was trained on a 50-piece subset. F1 scores reflect early-stage training and will improve with more data and training compute. The architecture is sound; scaling data and compute is the primary path to production quality.
# Transcribe with V2 model
python -m src.cli transcribe data/inputs/fur_elise.mp3 --model v2 -o output_v2.pdf
# With a specific checkpoint
python -m src.cli transcribe data/inputs/fur_elise.mp3 --model v2 --checkpoint models/v2/best_model.pt
# MusicXML only (no PDF)
python -m src.cli transcribe data/inputs/fur_elise.mp3 --model v2 --no-pdfsrc/v2/
├── model.py # PianoTranscriptionModel (Whisper encoder + causal decoder)
├── tokenizer.py # Token vocabulary, encode_notes(), decode_tokens()
├── spectrogram.py # WhisperSpectrogramExtractor → [1, 80, 3000]
├── dataset.py # MAESTRODataset, segment pipeline, teacher-forcing batches
├── train.py # Two-phase SFT Trainer (Phase A + Phase B)
├── transcribe.py # Inference bridge: audio → chunks → NoteEvents → TranscriptionResult
└── evaluate.py # Note-level P/R/F1 evaluation against MAESTRO MIDI ground truth
models/v2/
└── best_model.pt # Best checkpoint (val_loss=2.71, 50 pieces, 15 epochs)
- Small training set: 50 pieces is far below the full MAESTRO dataset (~1,200 pieces). F1 will improve with scale.
- O(n²) inference: No KV cache —
nn.TransformerDecoderre-runs full self-attention over all past tokens at O(n²). Inference is slow for long sequences. - Single clef output: No grand staff splitting (treble only).
- Token budget:
max_gen_tokens=128per 10s chunk trades recall for speed (dense passages may be truncated).
V1 builds a working end-to-end pipeline using basic-pitch (Spotify's ICASSP 2022 model) as the transcription backend. No custom ML training required.
Audio → basic-pitch CNN → NoteEvents → Quantizer → MusicXML → PDF
- Accurate pitch detection across the full 88-key piano range (MIDI 21–108)
- Correct onset timing and note durations
- Configurable tempo, time signature, and detection thresholds
- Slightly poor rest detection; pedal/sustain not modeled
# Generate sheet music PDF
python -m src.cli transcribe input.mp3 -o output.pdf
# With custom tempo and time signature
python -m src.cli transcribe data/inputs/Gymnopedie.mp3 -o data/outputs/gymnopedie.pdf --tempo 54 --time-sig 3/4
# MusicXML only
python -m src.cli transcribe input.mp3 --no-pdf| Option | Default | Description |
|---|---|---|
--tempo |
120 | Tempo in BPM |
--time-sig |
4/4 | Time signature |
--onset-threshold |
0.5 | Onset detection sensitivity (0–1) |
--frame-threshold |
0.3 | Note frame sensitivity (0–1) |
--title |
filename | Score title |
--no-pdf |
false | Output MusicXML only |
--keep-musicxml |
false | Keep MusicXML alongside PDF |
- ONNX Runtime is used instead of TensorFlow for Python 3.12 compatibility. See
docs/inference_guide.md. - basic-pitch (ICASSP 2022 model) provides the CNN that produces onset, note, and contour predictions from Harmonic CQT spectrograms.
- music21 handles MusicXML encoding. MuseScore handles PDF rendering.
- A scipy compatibility shim patches
scipy.signal.gaussianfor scipy 1.14+ (seesrc/inference.py).
The V1 pipeline is exposed as a REST API built on FastAPI + Celery + Redis (Upstash), with file storage on Cloudflare R2 and optional OpenAI musical analysis.
Client → FastAPI → Celery task queue → Worker
↑ ↓
Redis (Upstash) V1 pipeline → R2 storage
↓
OpenAI GPT-4o-mini (optional)
| Component | Role |
|---|---|
| FastAPI | Accepts audio uploads, dispatches jobs, serves status/download endpoints |
| Celery | Runs transcription in the background so the HTTP request returns immediately |
| Redis (Upstash) | Broker + result store for Celery, supports rediss:// SSL |
| Cloudflare R2 | Stores completed PDF and MusicXML outputs, served via presigned URLs |
| OpenAI GPT-4o-mini | Optional — identifies the piece, estimates difficulty, gives a practice tip |
Endpoints:
| Method | Path | Description |
|---|---|---|
POST |
/transcribe |
Upload audio, returns job_id |
GET |
/status/{job_id} |
Poll job state (pending → processing → done) |
GET |
/result/{job_id}/pdf |
Redirect to presigned R2 PDF URL |
GET |
/result/{job_id}/musicxml |
Redirect to presigned R2 MusicXML URL |
Running locally:
# Terminal 1 — API server
uvicorn api.main:app --reload
# Terminal 2 — Celery worker
celery -A api.celery_app.celery worker --loglevel=infoEnvironment variables:
| Variable | Required | Description |
|---|---|---|
UPSTASH_REDIS_URL |
Yes (or REDIS_URL) |
Redis broker/backend (rediss:// for Upstash SSL) |
REDIS_URL |
Fallback | Plain Redis URL for local dev |
R2_ACCOUNT_ID |
Yes | Cloudflare R2 account |
R2_ACCESS_KEY_ID |
Yes | R2 credentials |
R2_SECRET_ACCESS_KEY |
Yes | R2 credentials |
R2_BUCKET |
Yes | R2 bucket name |
OPENAI_API_KEY |
No | Enables GPT-4o-mini analysis; pipeline works without it |
FRONTEND_URL |
No | CORS origin (default: http://localhost:3000) |
Python 3.12+ required.
git clone https://github.com/NishevithaV/HarmonyNet.git
cd HarmonyNet
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtCheck all dependencies:
python -m src.cli checkInstall MuseScore 4 from: https://musescore.org/en/download
macOS: /Applications/MuseScore 4.app/Contents/MacOS/mscore
Linux: /usr/bin/mscore or /usr/local/bin/mscore4
Windows: C:\Program Files\MuseScore 4\bin\MuseScore4.exe
If MuseScore is not installed, the pipeline still produces MusicXML output that can be opened in any notation software.
V2 requires a trained checkpoint at models/v2/best_model.pt. It will auto-download from HuggingFace the first time you run --model v2.
Checkpoint hosted at: https://huggingface.co/nishevithav/harmonynet-v2
To train from scratch (requires MAESTRO v3 audio data):
python -m src.v2.trainTraining on 50 pieces takes ~2–3 hours on an Apple M-series chip.
Pre-generated PDFs are in data/outputs/. V1 outputs used explicit tempo and time signature.
| Piece | Tempo | Time Sig | Notes detected | Output |
|---|---|---|---|---|
| Für Elise | 72 BPM | 3/8 | 1747 | fur_elise.pdf |
| Gymnopedie No. 1 | 54 BPM | 3/4 | 841 | gymnopedie.pdf |
| C Major Scale | 120 BPM | 4/4 | 8 | c_major_scale.pdf |
| Für Elise (V2 model) | — | — | 246 | fur_elise_v2.pdf |
