Although high-throughput sequencing provides access to complete genomes, the structural annotation of genes in these genomes still remains a key step, especially in plants that have complex genomes (polyploidy, numerous transposable elements). The recent application of deep learning in annotation tools will surely make it possible to go faster in proposing annotations on both the structural and functional sides.
The GBOT database contains 6 plant genomes, all of which contain the official annotation of these genomes, plus for some, the annotation generated after the use of Helixer (Stiehler et al. 2021), an annotation tool that combines deep neural networks and HMM-type models to predict gene models from the genomic sequence alone. The internship consists in :
-
Applying and understanding Helixer on genomes not yet annotated
-
Doing a global comparison on each genome (A comparison was already done before.)
-
Targeting new genes defined by Helixer and highlight their characteristics on the structural side (gene size, number of exons) and functional side (generated protein and functional annotation)
-
Targeting genes corresponding to known genes but without 5’UTR, and taking stock of the properties of these genes, checking if TATA-box near the new annotated UTR