We selected the representative genomes from GTDB r226 with select_proteomes.R.
Requirements:
- Tidyverse (2.0.0)
- rentrez (1.2.4)
- phylotools (0.2.2)
This script also contains the commands to download the genomes via NCBI datasets.
We checked their translation table via check_translation_codes.py.
Requirements:
- urllib.request
We then performed genome annotation with Prokka using the following command:
prokka --kingdom $kingdom --outdir ${output} --metagenome ${input}
Once we obtained the proteomes, we ran OrthoFinder with the following commands:
orthofinder -op -S blast -f ${input_folder}/ | grep "blastp" | grep -v "Test" | sed 's/-query/-num_threads 4 -query/'> blast_cmds.txt
# Run BLAST commands in parallel, once finished:
orthofinder -b ${WorkingDir} -og -a 112
With this, we obtained the orthogroups that we parse using 01_PangenomeStats.R
to get the pangenome statistics. This script requires:
- micropan (2.1)
- Tidyverse (2.0.0)
- ggpubr (0.6.2)
- rentrez (1.2.4)
- viridis (0.6.5)
To obtain the mOTUpan pangenome estimates, we ran:
mOTUpan.py --faas ${fasta_dir}/*.faa -o $ofile
To obtain the KO assignments, we ran KoFamScan and filtered the results/collapsed per orthogroup following the script from Bernabeu et al. 2025.
We then select orthogroups with more than 4 sequences and at least one Asgard/Thermoproteota with select_orthogroups.R.
Requirements:
- Tidyverse (2.0.0)
We reconstruct the gene trees for these orthogroups using the workflow from Bernabeu et al. (2025). AleRax will then additionally filter these to take only gene trees with at least 4 species. We ran AleRax as follows:
srun -n 56 alerax \
-f ${input} \
-s ${tree} \
-p ${output} \
--gene-tree-samples 100 --model-parametrization PER-FAMILY
Adding --rec-model UNDATED-DL when necessary, for both topologies.
We generated the phylomes and parsed the gene trees using the pipeline from PhylomeDB. See the manuscript for references.
We parsed the results with these scripts:
00_Prepare_data.R: Parses AleRax output and prepares the output tables that are the base for Figure 201_Phylome_Data.R: Parses the Phylome data and prepares the output tables02_Tandem_Duplications.R: Analyzes tandem duplications03_Plot.R: Generates Figure 2 from the tables generated from previous scripts04_Aux_Figures.R: Generates auxiliary figures
Library requirements:
- phylotools 0.2.2
- phytools 2.5-2
- tidyverse 2.0.0
- ggtree 3.16.3
- ggpubr 0.6.2
- gggenes 0.5.1
- groupcompare 1.0.1
- ggridges 0.5.7
- cowplot 1.2.0
To assess HGT candidates, we first ran HGTector with the following commands:
To generate the BLAST search:
blastp -query ${genome}.faa -db ${broaddb} -outfmt "6 qaccver saccver pident evalue bitscore qcovhsp staxids" -num_threads 112 > ${genome}.out
This BLAST results are then parsed with:
hgtector search -i ${genome}.faa -m precomp -s ${genome}.out -o ${genome} -t ${taxdump}
hgtector analyze -i ${genome}.tsv -o ${genome} --t ${taxdump} --self-tax 25 --close-tax 3
Which yields a list of proteins flagged as HGT candidates. We then query the BLAST results to retrieve homologs with hgtprep.py.
This is a python script that takes the following arguments:
- -i : input list of putative HGT gene IDs as obtained from HGTector
- -b : BLAST results from HGTector
- -s : taxID/Accession from seed organism to label sequence
- -f : FASTA file from seed organism
- -d : BLAST database (to retrieve the sequences)
- -o : output directory
- -n : number of sequences to be retrieved (default 250
and requires the following packages:
- biopython
- pandas
- sys, argparse, os
- subprocess
and then reconstruct the gene trees with the pipeline from Bernabeu et al (2024). To root the trees with MAD, we used the function
tree.mod.root_on_minimal_ancestor_deviation() from the package toytree.
The gene trees are then analyzed both by Abaccus and by our in-house script HGT_Detector.py.
This is another python script that takes the following arguments:
- -p : Directory where the gene trees are stored
- -i : Input lineage (accession)
- -r : rooting (midpoint or MAD)
- -t : Taxonomy file, a ";"-separated table
- -o : output directory
and requires the following packages:
- ete3
- re
- functools
- argparse
- glob
- pandas
- multiprocessing
Trees that are marked as eukaryote-saturated require an additional BLAST search against BROAD-DB
restricted to prokaryotes and a filtering of the eukaryotic sequences performed by rerun_neuks.R.
This generates a table that is further refined with 00_Parse_Data.R, which additionally adds GFF and functional annotation information.
It is an R script that takes the following arguments:
- -l : name of the lineage (accession)
- -d : path to data directory
- -o : output directory
- -t : topology (old/alt)
and uses the following libraries:
- tidyverse
- ape
- phylotools
- argparse
With this, the figures related to inter-domain HGT are then generated with 01_Analysis_Figs.R and any additional tests with 02_Additional_tests.R.
These scripts require the following libraries:
- ape
- tidyverse
- ggtree
- ggtreeExtra
- viridis
- aplo
- topGO
- clusterProfiler
- ggpubr
- igraph
- gggenes
Trees with eukaryotic sequences are further inspected with 03_Euk_trees.R.