ML models that use SBOL for DNA annotation

### Background
Machine Learning (ML) quickly advanced in the last few years. Key to the ML models is the training process where models learn from data. Therefore, data quality and quantity are fundamental to model performance. Datasets like ImageNet have promoted the use of ML on the Computer Vision field for image classification leading to major advances like the development of AlexNet and ResNet for example. Synthetic Biology (SynBio) looks for engineering biological systems and has created abstractions and standards for that endeavor. Sequence-to-expression is a hallmark for SynBio but the field lacks easy to use datasets so researchers and developers can focus on creating new models instead of gathering and preprocessing data. In a previous project we started the development of [SeqTrainer](https://github.com/SynBioDex/SeqTrainer) to create ML datasets from SBOL [1] designs SynBioHub. To advance this endeavor we need to include tokenizers and ML models tailored for DNA sequence data.
In this project we will explore the use of foundational models such as Evo 2, DNABert, HyenaDNA and TITANS to predict constitutive expression and label DNA with sequence features from SBOL such as promoters and coding sequences.


[1] Buecherl, Lukas, et al. "Synthetic biology open language (SBOL) version 3.1. 0." Journal of integrative bioinformatics 20.1 (2023): 20220058.
[2] Urtecho, Guillaume, et al. "Genome-wide Functional Characterization of Escherichia coli Promoters and Sequence Elements Encoding Their Regulation." eLife 12 (2023).

### Goal
Develop models to for DNA annotation using SBOL

Specific Goals:
Develop workflows to train models using HPC.
Explore baseline performance of existing models.
Modify a model, domain adapt and finetune for bacteria.
Document the developments and create example notebooks.

### Difficulty Level: Medium
This project involves applying ML models for predicting expression, DNA annotation using SBOL and its inclusion in a Python package for data set creation and training on ML models. 

### Size and Length of Project
- **medium: 175 hours**
- **16 weeks**

### Skills
Essential skills: Python, ML, GitHub, Git
Nice to have skills: SBOL, LLM

### Public Repository
https://github.com/SynBioDex/SeqTrainer

### Potential Mentors
Gonzalo Vidal (Gonzalo.vidalpena@colorado.edu)
Chris Myers


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML models that use SBOL for DNA annotation #288

Background

Goal

Difficulty Level: Medium

Size and Length of Project

Skills

Public Repository

Potential Mentors

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ML models that use SBOL for DNA annotation #288

Description

Background

Goal

Difficulty Level: Medium

Size and Length of Project

Skills

Public Repository

Potential Mentors

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions