TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.
Tested on Ubuntu 22.04.4 LTS with PyTorch 2.1.0 and CUDA 12.1.
- Create the conda environment:
conda env create -f treconlm.yml
- Install build tools (needed to compile extensions):
sudo apt update && sudo apt install -y build-essential - Add the project to your Python path:
export PYTHONPATH="${PYTHONPATH}:/path/to/treconlm"
Open the project in VS Code with the Dev Containers extension. It builds and starts a Docker container with all dependencies pre-installed from .devcontainer/devcontainer.json.
Note: You may need to adjust
mountsandrunArgsin.devcontainer/devcontainer.jsonfor your machine (memory, CPUs, extra bind mounts).
To use a different PyTorch / CUDA version:
- Local setup: update the
torch/torchvision/cudatoolkit/--extra-index-urlentries intreconlm.yml. - Dev Container: update the
FROMimage in.devcontainer/Dockerfile.
See PyTorch previous versions for compatible combinations.
Pretrained and fine-tuned models, as well as synthetic test datasets, are available on Hugging Face:
Start with the tutorial notebooks in tutorial/:
quick_start.ipynb: Download models from HuggingFace and run inference on synthetic datasetscustom_data.ipynb: Run inference on your own data or use the Microsoft/Noisy/Chandak DNA datasets
python src/inference.py exps=<experiment>Quick test (runs inference on tutorial/example_data with a pretrained model):
# Download a model from HuggingFace
mkdir -p models
python -c "from huggingface_hub import hf_hub_download; hf_hub_download('tracereconstruction2026/TReconLM', 'model_seq_len_110.pt', local_dir='models')"
# Run inference
python src/inference.py exps=test/inference_exampleFor custom data, provide two files:
ground_truth.txt: one DNA sequence per line (ACGT only)reads.txt:clusters of 2-10 noisy reads separated by===============================
See tutorial/custom_data.ipynb for details.
To run with FlashAttention for faster training (see PyTorch issue):
pip install nvidia-cuda-nvcc-cu11
export TRITON_PTXAS_PATH=/opt/conda/envs/treconlm/lib/python3.11/site-packages/nvidia/cuda_nvcc/bin/ptxaspython src/pretrain.py exps=<experiment>Quick test (runs 100 iterations with a small model):
python src/pretrain.py exps=test/pretrain_scratchTo reproduce paper results or train with different settings, choose an experiment from src/hydra/train_config/exps/ (e.g., ids_110nt/ids_110nt, ids_60nt/ids_60nt).
Use
torchrun --nproc_per_node=<gpus>for multi-GPU training. Pretraining data is generated on the fly.
python src/finetune.py exps=<experiment>Quick test (runs 100 iterations on tutorial/example_data):
python src/finetune.py exps=test/finetune_scratchTo fine-tune on real datasets, use experiments like microsoft/mic or noisyDNA/noisy from src/hydra/train_config/exps/.
Example cluster scripts can be found in src/slurm_pkg/.
Note: If you get
ImportError: ... GLIBCXX_3.4.29 not found(scipy/torchmetrics), your system'slibstdc++is too old. Install conda's version into the env:conda install -c conda-forge "libstdcxx-ng>=12"
-
Pretraining:
Training a ~38M parameter model on ~300M examples (sequence lengthL = 110, cluster sizes uniformly sampled between 2 and 10, totaling ~440B tokens) on 4 NVIDIA H100 GPUs takes approximately 71.1 hours. -
Fine-tuning:
Fine-tuning a ~38M parameter model on ~5.5M examples (sequence lengthL = 60, cluster sizes between 2 and 10, totaling ~4.39B tokens) takes approximately 20.6 hours.
Configuration files for our synthetic data generation are in:
src/hydra/data_config
To generate new test datasets, run from src:
python data_pkg/data_generation.pyTo run inference with non-deep learning baselines:
python src/eval_pkg/eval_all_baselines.py --alg <algorithm>Available algorithms:
ALGS = {
'bmala': BMALA,
'itr': Iterative,
'muscle': MuscleAlgorithm,
'trellisbma': TrellisBMAAlgorithm,
'vs': VSAlgorithm,
}An example cluster script is available in src/slurm_pkg/baselines.
To pretrain, fine-tune, or run inference with our deep learning baselines, see:
DeepLearningBaselines/DNAFormer/slurm_pkgDeepLearningBaselines/RobuSeqNet/slurm_pkg
These contain example SLURM execution scripts.
Requires Python 3.11. Create a fresh environment and run the test suite with nox:
conda create -n treconlm python=3.11 -y
conda activate treconlm
pip install nox
noxThe original implementations of the baselines were taken from:
- VS, BMALA, ITR: github.com/omersabary/Reconstruction
- MUSCLE: github.com/rcedgar/muscle
- TrellisBMA: github.com/orenht/DNA-trellis-reconstruction
- RobuSeqNet: github.com/qinyunnn/RobuSeqNet
- DNAformer: github.com/itaiorr/Deep-DNA-based-storage