lilyBERT

Code companion for "BMdataset: A Musicologically Curated LilyPond Dataset".

LilyPond is a text-based music engraving language with formal grammar, block structure, and backslash commands — making it structurally similar to a programming language. lilyBERT leverages this by starting from CodeBERT and adapting it to LilyPond through vocabulary extension with 115 domain-specific tokens and masked language model (MLM) pre-training.

Resource	Link
Paper	Arxiv (2604.10628)
Dataset	Zenodo (doi:10.5281/zenodo.18723290)
Model	HuggingFace (csc-unipd/lilybert)
Code	GitHub (CSCPadova/lilybert)

Key results

Linear probing on the out-of-domain Mutopia corpus (layer 6, 5-fold CV):

Model	Composer Acc.	Style Acc.
CB + PDMX_full (15B tokens)	80.8	82.6
CB + BMdataset (90M tokens)	82.9	83.7
CB + PDMX_90M (90M tokens)	81.7	82.3
CB + PDMX -> BM	84.3	82.9

90M tokens of expertly curated data outperform 15B tokens of automatically converted data. Combining broad pre-training with domain-specific fine-tuning yields the best overall composer accuracy (84.3%).

Installation

# Using uv (recommended)
uv sync
uv pip install -e ".[dev]"

# Using pip
pip install -e .
pip install -e ".[dev]"

Reproducing paper results

Download the dataset

Download the BMdataset from Zenodo and extract it into data/raw/.

Preprocess

preprocess \
  preprocess.input_dir=data/raw \
  preprocess.output_dir=data/processed \
  preprocess.sharding.enabled=true \
  preprocess.sharding.tokenizer_path=artifacts/tokenizer \
  preprocess.sharding.output_dir=artifacts/pretokenized

Train (MLM pre-training)

train \
  dataset.processed_dir=data/processed \
  dataset.tokenizer_path=artifacts/tokenizer \
  runtime.output_dir=outputs/pretraining

Generate figures

python scripts/generate_layer_plot.py
python scripts/generate_tsne.py
python scripts/generate_confusion_matrix.py

CLI reference

Command	Description
`preprocess`	Preprocessing, tokenizer building, and sharding
`train`	MLM pre-training and finetuning
`embed`	Extract frozen-encoder embeddings for downstream tasks

All commands use Hydra for configuration. Run any command with --help for usage details.

ONNX inference

An ONNX version of the model is available on HuggingFace for inference without PyTorch, in any language with an ONNX runtime (Python, C++, C#, Java, JavaScript, Rust, ...).

Install the optional ONNX dependencies:

pip install -e ".[onnx]"

Python

from lilybert.models import LilyBERTOnnxEncoder

encoder = LilyBERTOnnxEncoder("csc-unipd/lilybert")
embeddings = encoder.encode(input_ids, attention_mask)  # (batch, 768) numpy array

Other languages

Download model.onnx and tokenizer.json from csc-unipd/lilybert on HuggingFace. The ONNX model accepts int64 inputs (input_ids, attention_mask) and returns float32 last_hidden_state of shape (batch, seq_len, 768). Use index [:, 0, :] for CLS embeddings.

Python API

from lilybert.data import LilyPondParser, LilyPondPreprocessor, LilyPondTokenizer
from lilybert.models import LilyBERTEncoder
from lilybert.training import TrainingConfig, MLMPretrainer

Project structure

lilybert/
├── cli/                # CLI entry points (preprocess, train, embed)
├── data/               # Parsing, tokenization, sharding, datasets
│   ├── lexer.py        # MusicalLexer — LilyPond to linear token conversion
│   ├── parser.py       # Syntax validation, pitch normalization
│   ├── tokenizer.py    # Parser-aware BPE tokenizer
│   └── ...
├── models/
│   └── bert_classifier.py  # LilyBERTEncoder (CodeBERT wrapper)
└── training/
    ├── trainer.py      # MLMPretrainer (HuggingFace Trainer-based)
    ├── config.py       # TrainingConfig dataclass
    └── distributed.py  # DDP/FSDP utilities

conf/                   # Hydra configuration
scripts/                # SLURM scripts and figure generation
notebooks/              # Linear probing analysis
docs/                   # Design documentation

Configuration

Hydra configuration follows a single-base pattern:

Base config: conf/config.yaml
Shared groups: conf/dataset/, conf/model/, conf/runtime/, conf/environment/

Testing

pytest tests/
pytest tests/ -v --cov=lilybert
pytest -m "not slow"    # skip slow tests
pytest -m "not model"   # skip tests requiring model downloads

Citation

@misc{spanio2026bmdatasetmusicologicallycuratedlilypond,
      title={BMdataset: A Musicologically Curated LilyPond Dataset}, 
      author={Matteo Spanio and Ilay Guler and Antonio Rodà},
      year={2026},
      eprint={2604.10628},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2604.10628}, 
}

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
.github/workflows		.github/workflows
conf		conf
data		data
docs		docs
lilybert		lilybert
notebooks		notebooks
scripts		scripts
tests		tests
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lilyBERT

Key results

Installation

Reproducing paper results

Download the dataset

Preprocess

Train (MLM pre-training)

Generate figures

CLI reference

ONNX inference

Python

Other languages

Python API

Project structure

Configuration

Testing

Citation

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

lilyBERT

Key results

Installation

Reproducing paper results

Download the dataset

Preprocess

Train (MLM pre-training)

Generate figures

CLI reference

ONNX inference

Python

Other languages

Python API

Project structure

Configuration

Testing

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages