Skip to content

ML models that use SBOL for DNA annotation #288

@Gonza10V

Description

@Gonza10V

Background

Machine Learning (ML) quickly advanced in the last few years. Key to the ML models is the training process where models learn from data. Therefore, data quality and quantity are fundamental to model performance. Datasets like ImageNet have promoted the use of ML on the Computer Vision field for image classification leading to major advances like the development of AlexNet and ResNet for example. Synthetic Biology (SynBio) looks for engineering biological systems and has created abstractions and standards for that endeavor. Sequence-to-expression is a hallmark for SynBio but the field lacks easy to use datasets so researchers and developers can focus on creating new models instead of gathering and preprocessing data. In a previous project we started the development of SeqTrainer to create ML datasets from SBOL [1] designs SynBioHub. To advance this endeavor we need to include tokenizers and ML models tailored for DNA sequence data.
In this project we will explore the use of foundational models such as Evo 2, DNABert, HyenaDNA and TITANS to predict constitutive expression and label DNA with sequence features from SBOL such as promoters and coding sequences.

[1] Buecherl, Lukas, et al. "Synthetic biology open language (SBOL) version 3.1. 0." Journal of integrative bioinformatics 20.1 (2023): 20220058.
[2] Urtecho, Guillaume, et al. "Genome-wide Functional Characterization of Escherichia coli Promoters and Sequence Elements Encoding Their Regulation." eLife 12 (2023).

Goal

Develop models to for DNA annotation using SBOL

Specific Goals:
Develop workflows to train models using HPC.
Explore baseline performance of existing models.
Modify a model, domain adapt and finetune for bacteria.
Document the developments and create example notebooks.

Difficulty Level: Medium

This project involves applying ML models for predicting expression, DNA annotation using SBOL and its inclusion in a Python package for data set creation and training on ML models.

Size and Length of Project

  • medium: 175 hours
  • 16 weeks

Skills

Essential skills: Python, ML, GitHub, Git
Nice to have skills: SBOL, LLM

Public Repository

https://github.com/SynBioDex/SeqTrainer

Potential Mentors

Gonzalo Vidal (Gonzalo.vidalpena@colorado.edu)
Chris Myers

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions