Skip to content

Gabaldonlab/Asgard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

DATA PREPARATION AND PANGENOME ANALYSIS

We selected the representative genomes from GTDB r226 with select_proteomes.R. Requirements:

  • Tidyverse (2.0.0)
  • rentrez (1.2.4)
  • phylotools (0.2.2)

This script also contains the commands to download the genomes via NCBI datasets. We checked their translation table via check_translation_codes.py. Requirements:

  • urllib.request

We then performed genome annotation with Prokka using the following command:

prokka --kingdom $kingdom --outdir ${output} --metagenome ${input}

Once we obtained the proteomes, we ran OrthoFinder with the following commands:

orthofinder -op -S blast -f ${input_folder}/ | grep "blastp" | grep -v "Test" | sed 's/-query/-num_threads 4 -query/'> blast_cmds.txt
# Run BLAST commands in parallel, once finished:
orthofinder -b  ${WorkingDir} -og -a 112

With this, we obtained the orthogroups that we parse using 01_PangenomeStats.R to get the pangenome statistics. This script requires:

  • micropan (2.1)
  • Tidyverse (2.0.0)
  • ggpubr (0.6.2)
  • rentrez (1.2.4)
  • viridis (0.6.5)

To obtain the mOTUpan pangenome estimates, we ran:

mOTUpan.py --faas ${fasta_dir}/*.faa -o $ofile

To obtain the KO assignments, we ran KoFamScan and filtered the results/collapsed per orthogroup following the script from Bernabeu et al. 2025.

DTL

Reconciliation

We then select orthogroups with more than 4 sequences and at least one Asgard/Thermoproteota with select_orthogroups.R. Requirements:

  • Tidyverse (2.0.0)

We reconstruct the gene trees for these orthogroups using the workflow from Bernabeu et al. (2025). AleRax will then additionally filter these to take only gene trees with at least 4 species. We ran AleRax as follows:

srun -n 56 alerax \
-f ${input} \
-s ${tree}  \
-p ${output} \
--gene-tree-samples 100 --model-parametrization PER-FAMILY 

Adding --rec-model UNDATED-DL when necessary, for both topologies.

Species-overlap (phylome)

We generated the phylomes and parsed the gene trees using the pipeline from PhylomeDB. See the manuscript for references.

Parse results

We parsed the results with these scripts:

  • 00_Prepare_data.R : Parses AleRax output and prepares the output tables that are the base for Figure 2
  • 01_Phylome_Data.R : Parses the Phylome data and prepares the output tables
  • 02_Tandem_Duplications.R : Analyzes tandem duplications
  • 03_Plot.R : Generates Figure 2 from the tables generated from previous scripts
  • 04_Aux_Figures.R : Generates auxiliary figures

Library requirements:

  • phylotools 0.2.2
  • phytools 2.5-2
  • tidyverse 2.0.0
  • ggtree 3.16.3
  • ggpubr 0.6.2
  • gggenes 0.5.1
  • groupcompare 1.0.1
  • ggridges 0.5.7
  • cowplot 1.2.0

INTER-DOMAIN HORIZONTAL GENE TRANSFER

To assess HGT candidates, we first ran HGTector with the following commands:

To generate the BLAST search:

blastp -query ${genome}.faa -db ${broaddb} -outfmt "6 qaccver saccver pident evalue bitscore qcovhsp staxids" -num_threads 112 > ${genome}.out

This BLAST results are then parsed with:

hgtector search -i ${genome}.faa -m precomp -s ${genome}.out -o ${genome} -t ${taxdump}
hgtector analyze -i ${genome}.tsv -o ${genome} --t ${taxdump} --self-tax 25 --close-tax 3

Which yields a list of proteins flagged as HGT candidates. We then query the BLAST results to retrieve homologs with hgtprep.py. This is a python script that takes the following arguments:

  • -i : input list of putative HGT gene IDs as obtained from HGTector
  • -b : BLAST results from HGTector
  • -s : taxID/Accession from seed organism to label sequence
  • -f : FASTA file from seed organism
  • -d : BLAST database (to retrieve the sequences)
  • -o : output directory
  • -n : number of sequences to be retrieved (default 250

and requires the following packages:

  • biopython
  • pandas
  • sys, argparse, os
  • subprocess

and then reconstruct the gene trees with the pipeline from Bernabeu et al (2024). To root the trees with MAD, we used the function tree.mod.root_on_minimal_ancestor_deviation() from the package toytree.

The gene trees are then analyzed both by Abaccus and by our in-house script HGT_Detector.py. This is another python script that takes the following arguments:

  • -p : Directory where the gene trees are stored
  • -i : Input lineage (accession)
  • -r : rooting (midpoint or MAD)
  • -t : Taxonomy file, a ";"-separated table
  • -o : output directory

and requires the following packages:

  • ete3
  • re
  • functools
  • argparse
  • glob
  • pandas
  • multiprocessing

Trees that are marked as eukaryote-saturated require an additional BLAST search against BROAD-DB restricted to prokaryotes and a filtering of the eukaryotic sequences performed by rerun_neuks.R.

This generates a table that is further refined with 00_Parse_Data.R, which additionally adds GFF and functional annotation information. It is an R script that takes the following arguments:

  • -l : name of the lineage (accession)
  • -d : path to data directory
  • -o : output directory
  • -t : topology (old/alt)

and uses the following libraries:

  • tidyverse
  • ape
  • phylotools
  • argparse

With this, the figures related to inter-domain HGT are then generated with 01_Analysis_Figs.R and any additional tests with 02_Additional_tests.R.

These scripts require the following libraries:

  • ape
  • tidyverse
  • ggtree
  • ggtreeExtra
  • viridis
  • aplo
  • topGO
  • clusterProfiler
  • ggpubr
  • igraph
  • gggenes

Trees with eukaryotic sequences are further inspected with 03_Euk_trees.R.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors