Skip to content

CSCPadova/lilybert

Repository files navigation

lilyBERT

Tests License Python 3.11+ Dataset Model

Code companion for "BMdataset: A Musicologically Curated LilyPond Dataset".

LilyPond is a text-based music engraving language with formal grammar, block structure, and backslash commands — making it structurally similar to a programming language. lilyBERT leverages this by starting from CodeBERT and adapting it to LilyPond through vocabulary extension with 115 domain-specific tokens and masked language model (MLM) pre-training.

Resource Link
Paper Arxiv (2604.10628)
Dataset Zenodo (doi:10.5281/zenodo.18723290)
Model HuggingFace (csc-unipd/lilybert)
Code GitHub (CSCPadova/lilybert)

Key results

Linear probing on the out-of-domain Mutopia corpus (layer 6, 5-fold CV):

Model Composer Acc. Style Acc.
CB + PDMX_full (15B tokens) 80.8 82.6
CB + BMdataset (90M tokens) 82.9 83.7
CB + PDMX_90M (90M tokens) 81.7 82.3
CB + PDMX -> BM 84.3 82.9

90M tokens of expertly curated data outperform 15B tokens of automatically converted data. Combining broad pre-training with domain-specific fine-tuning yields the best overall composer accuracy (84.3%).

Installation

# Using uv (recommended)
uv sync
uv pip install -e ".[dev]"

# Using pip
pip install -e .
pip install -e ".[dev]"

Reproducing paper results

Download the dataset

Download the BMdataset from Zenodo and extract it into data/raw/.

Preprocess

preprocess \
  preprocess.input_dir=data/raw \
  preprocess.output_dir=data/processed \
  preprocess.sharding.enabled=true \
  preprocess.sharding.tokenizer_path=artifacts/tokenizer \
  preprocess.sharding.output_dir=artifacts/pretokenized

Train (MLM pre-training)

train \
  dataset.processed_dir=data/processed \
  dataset.tokenizer_path=artifacts/tokenizer \
  runtime.output_dir=outputs/pretraining

Generate figures

python scripts/generate_layer_plot.py
python scripts/generate_tsne.py
python scripts/generate_confusion_matrix.py

CLI reference

Command Description
preprocess Preprocessing, tokenizer building, and sharding
train MLM pre-training and finetuning
embed Extract frozen-encoder embeddings for downstream tasks

All commands use Hydra for configuration. Run any command with --help for usage details.

ONNX inference

An ONNX version of the model is available on HuggingFace for inference without PyTorch, in any language with an ONNX runtime (Python, C++, C#, Java, JavaScript, Rust, ...).

Install the optional ONNX dependencies:

pip install -e ".[onnx]"

Python

from lilybert.models import LilyBERTOnnxEncoder

encoder = LilyBERTOnnxEncoder("csc-unipd/lilybert")
embeddings = encoder.encode(input_ids, attention_mask)  # (batch, 768) numpy array

Other languages

Download model.onnx and tokenizer.json from csc-unipd/lilybert on HuggingFace. The ONNX model accepts int64 inputs (input_ids, attention_mask) and returns float32 last_hidden_state of shape (batch, seq_len, 768). Use index [:, 0, :] for CLS embeddings.

Python API

from lilybert.data import LilyPondParser, LilyPondPreprocessor, LilyPondTokenizer
from lilybert.models import LilyBERTEncoder
from lilybert.training import TrainingConfig, MLMPretrainer

Project structure

lilybert/
├── cli/                # CLI entry points (preprocess, train, embed)
├── data/               # Parsing, tokenization, sharding, datasets
│   ├── lexer.py        # MusicalLexer — LilyPond to linear token conversion
│   ├── parser.py       # Syntax validation, pitch normalization
│   ├── tokenizer.py    # Parser-aware BPE tokenizer
│   └── ...
├── models/
│   └── bert_classifier.py  # LilyBERTEncoder (CodeBERT wrapper)
└── training/
    ├── trainer.py      # MLMPretrainer (HuggingFace Trainer-based)
    ├── config.py       # TrainingConfig dataclass
    └── distributed.py  # DDP/FSDP utilities

conf/                   # Hydra configuration
scripts/                # SLURM scripts and figure generation
notebooks/              # Linear probing analysis
docs/                   # Design documentation

Configuration

Hydra configuration follows a single-base pattern:

  • Base config: conf/config.yaml
  • Shared groups: conf/dataset/, conf/model/, conf/runtime/, conf/environment/

Testing

pytest tests/
pytest tests/ -v --cov=lilybert
pytest -m "not slow"    # skip slow tests
pytest -m "not model"   # skip tests requiring model downloads

Citation

@misc{spanio2026bmdatasetmusicologicallycuratedlilypond,
      title={BMdataset: A Musicologically Curated LilyPond Dataset}, 
      author={Matteo Spanio and Ilay Guler and Antonio Rodà},
      year={2026},
      eprint={2604.10628},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2604.10628}, 
}

License

Apache-2.0. See LICENSE.

About

BERT-based MLM pretrained on LilyPond datasets

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors