Code companion for "BMdataset: A Musicologically Curated LilyPond Dataset".
LilyPond is a text-based music engraving language with formal grammar, block structure, and backslash commands — making it structurally similar to a programming language. lilyBERT leverages this by starting from CodeBERT and adapting it to LilyPond through vocabulary extension with 115 domain-specific tokens and masked language model (MLM) pre-training.
| Resource | Link |
|---|---|
| Paper | Arxiv (2604.10628) |
| Dataset | Zenodo (doi:10.5281/zenodo.18723290) |
| Model | HuggingFace (csc-unipd/lilybert) |
| Code | GitHub (CSCPadova/lilybert) |
Linear probing on the out-of-domain Mutopia corpus (layer 6, 5-fold CV):
| Model | Composer Acc. | Style Acc. |
|---|---|---|
| CB + PDMX_full (15B tokens) | 80.8 | 82.6 |
| CB + BMdataset (90M tokens) | 82.9 | 83.7 |
| CB + PDMX_90M (90M tokens) | 81.7 | 82.3 |
| CB + PDMX -> BM | 84.3 | 82.9 |
90M tokens of expertly curated data outperform 15B tokens of automatically converted data. Combining broad pre-training with domain-specific fine-tuning yields the best overall composer accuracy (84.3%).
# Using uv (recommended)
uv sync
uv pip install -e ".[dev]"
# Using pip
pip install -e .
pip install -e ".[dev]"Download the BMdataset from Zenodo and extract it into data/raw/.
preprocess \
preprocess.input_dir=data/raw \
preprocess.output_dir=data/processed \
preprocess.sharding.enabled=true \
preprocess.sharding.tokenizer_path=artifacts/tokenizer \
preprocess.sharding.output_dir=artifacts/pretokenizedtrain \
dataset.processed_dir=data/processed \
dataset.tokenizer_path=artifacts/tokenizer \
runtime.output_dir=outputs/pretrainingpython scripts/generate_layer_plot.py
python scripts/generate_tsne.py
python scripts/generate_confusion_matrix.py| Command | Description |
|---|---|
preprocess |
Preprocessing, tokenizer building, and sharding |
train |
MLM pre-training and finetuning |
embed |
Extract frozen-encoder embeddings for downstream tasks |
All commands use Hydra for configuration. Run any command with --help for usage details.
An ONNX version of the model is available on HuggingFace for inference without PyTorch, in any language with an ONNX runtime (Python, C++, C#, Java, JavaScript, Rust, ...).
Install the optional ONNX dependencies:
pip install -e ".[onnx]"from lilybert.models import LilyBERTOnnxEncoder
encoder = LilyBERTOnnxEncoder("csc-unipd/lilybert")
embeddings = encoder.encode(input_ids, attention_mask) # (batch, 768) numpy arrayDownload model.onnx and tokenizer.json from csc-unipd/lilybert on HuggingFace. The ONNX model accepts int64 inputs (input_ids, attention_mask) and returns float32 last_hidden_state of shape (batch, seq_len, 768). Use index [:, 0, :] for CLS embeddings.
from lilybert.data import LilyPondParser, LilyPondPreprocessor, LilyPondTokenizer
from lilybert.models import LilyBERTEncoder
from lilybert.training import TrainingConfig, MLMPretrainerlilybert/
├── cli/ # CLI entry points (preprocess, train, embed)
├── data/ # Parsing, tokenization, sharding, datasets
│ ├── lexer.py # MusicalLexer — LilyPond to linear token conversion
│ ├── parser.py # Syntax validation, pitch normalization
│ ├── tokenizer.py # Parser-aware BPE tokenizer
│ └── ...
├── models/
│ └── bert_classifier.py # LilyBERTEncoder (CodeBERT wrapper)
└── training/
├── trainer.py # MLMPretrainer (HuggingFace Trainer-based)
├── config.py # TrainingConfig dataclass
└── distributed.py # DDP/FSDP utilities
conf/ # Hydra configuration
scripts/ # SLURM scripts and figure generation
notebooks/ # Linear probing analysis
docs/ # Design documentation
Hydra configuration follows a single-base pattern:
- Base config:
conf/config.yaml - Shared groups:
conf/dataset/,conf/model/,conf/runtime/,conf/environment/
pytest tests/
pytest tests/ -v --cov=lilybert
pytest -m "not slow" # skip slow tests
pytest -m "not model" # skip tests requiring model downloads@misc{spanio2026bmdatasetmusicologicallycuratedlilypond,
title={BMdataset: A Musicologically Curated LilyPond Dataset},
author={Matteo Spanio and Ilay Guler and Antonio Rodà},
year={2026},
eprint={2604.10628},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2604.10628},
}Apache-2.0. See LICENSE.