-
Notifications
You must be signed in to change notification settings - Fork 60
ML models that use SBOL for DNA annotation #288
Description
Background
Machine Learning (ML) quickly advanced in the last few years. Key to the ML models is the training process where models learn from data. Therefore, data quality and quantity are fundamental to model performance. Datasets like ImageNet have promoted the use of ML on the Computer Vision field for image classification leading to major advances like the development of AlexNet and ResNet for example. Synthetic Biology (SynBio) looks for engineering biological systems and has created abstractions and standards for that endeavor. Sequence-to-expression is a hallmark for SynBio but the field lacks easy to use datasets so researchers and developers can focus on creating new models instead of gathering and preprocessing data. In a previous project we started the development of SeqTrainer to create ML datasets from SBOL [1] designs SynBioHub. To advance this endeavor we need to include tokenizers and ML models tailored for DNA sequence data.
In this project we will explore the use of foundational models such as Evo 2, DNABert, HyenaDNA and TITANS to predict constitutive expression and label DNA with sequence features from SBOL such as promoters and coding sequences.
[1] Buecherl, Lukas, et al. "Synthetic biology open language (SBOL) version 3.1. 0." Journal of integrative bioinformatics 20.1 (2023): 20220058.
[2] Urtecho, Guillaume, et al. "Genome-wide Functional Characterization of Escherichia coli Promoters and Sequence Elements Encoding Their Regulation." eLife 12 (2023).
Goal
Develop models to for DNA annotation using SBOL
Specific Goals:
Develop workflows to train models using HPC.
Explore baseline performance of existing models.
Modify a model, domain adapt and finetune for bacteria.
Document the developments and create example notebooks.
Difficulty Level: Medium
This project involves applying ML models for predicting expression, DNA annotation using SBOL and its inclusion in a Python package for data set creation and training on ML models.
Size and Length of Project
- medium: 175 hours
- 16 weeks
Skills
Essential skills: Python, ML, GitHub, Git
Nice to have skills: SBOL, LLM
Public Repository
https://github.com/SynBioDex/SeqTrainer
Potential Mentors
Gonzalo Vidal (Gonzalo.vidalpena@colorado.edu)
Chris Myers