This repository contains five modular, reproducible Bash pipelines designed for a comprehensive study of genome assembly, annotation, mutation identification, and mutation simulation in Arabidopsis thaliana. The scripts automate key steps from raw assembly to simulated evolution of tandem repeat arrays.
project-root/
├── 1.assembly.sh # Genome assembly and evaluation
├── 2.annotation.sh # Repeat annotation (CEN178, rDNA, SSRs)
├── 3.assembly_difference.sh # Assembly comparison, validation, and correction
├── 4.cen_mutation.sh # Centromeric mutation analysis and alignment refinement
├── 5.simulation.sh # Simulation of gene conversion and tandem repeat evolution
├── bin/ # Contains Python and R scripts used in the pipeline
└── data/ # Input reference genomes, annotations, and simulation templates
- HiFi reads de novo assembly with HiFiasm, IPA, Peregrine, Canu, Flye
- ONT reads de novo assembly with HiFiasm and polished by Dorado
- Reference-based scaffolding
- Assembly quality assessment using NG50, Merqury, and BUSCO
- Consensus genome generation for downstream mutation comparison
- CEN178 annotation using TRASH and CentroAnno
- rDNA and telomeric repeat detection with RepeatMasker and a custom library
- Simple sequence repeat annotation using Arabidopsis-specific repeat libraries
- TE annotation using EDTA and ATHILAfinder
- Structural variant calling using SYRI
- Misassembly validation using Illumina and HiFi reads based alignment
- Error correction with Pilon, DeepVariant, pbsv, and Sniffles
- Word-based alignment for optimal matching in repeat-dense regions
- Mutation left-alignment and false-positive filtering
- HOR score analysis and in-frame mutation pattern checking
- Gene conversion simulation: Introduces random point mutations and detects non-allelic conversion events via k-mer overlap
- Tandem repeat mutation simulation: Evolves five A. thaliana centromeric repeats over 150,000 generations, followed by analysis of homogenization, consensus generation, heatmap creation, and video animation
Each script has its own software dependencies, which are listed at the top of the respective file. Common tools and packages include:
- Genome assemblers: HiFiasm, IPA, Peregrine, Canu, Flye, dorado
- Annotation tools: TRASH, RepeatMasker, CentroAnno, ATHILAfinder
- Variant analysis: SYRI, Pilon, DeepVariant, Sniffles, pbsv
- Supporting scripts: Python, R, Perl (located in
bin/) - Custom input files: (located in
data/)
Make sure all dependencies are installed and paths are properly configured. Then, execute each script in order or independently as needed:
Please cite the following work when using this repository or any of its components in your research:
Dong, X. et al. "The mutational dynamics of the Arabidopsis centromeres" bioRxiv, https://doi.org/10.1101/2025.06.02.657473
For questions, please contact: xdong@mpipz.mpg.de