DATA PREPARATION AND PANGENOME ANALYSIS

We selected the representative genomes from GTDB r226 with select_proteomes.R. Requirements:

Tidyverse (2.0.0)
rentrez (1.2.4)
phylotools (0.2.2)

This script also contains the commands to download the genomes via NCBI datasets. We checked their translation table via check_translation_codes.py. Requirements:

urllib.request

We then performed genome annotation with Prokka using the following command:

prokka --kingdom $kingdom --outdir ${output} --metagenome ${input}

Once we obtained the proteomes, we ran OrthoFinder with the following commands:

orthofinder -op -S blast -f ${input_folder}/ | grep "blastp" | grep -v "Test" | sed 's/-query/-num_threads 4 -query/'> blast_cmds.txt
# Run BLAST commands in parallel, once finished:
orthofinder -b  ${WorkingDir} -og -a 112

With this, we obtained the orthogroups that we parse using 01_PangenomeStats.R to get the pangenome statistics. This script requires:

micropan (2.1)
Tidyverse (2.0.0)
ggpubr (0.6.2)
rentrez (1.2.4)
viridis (0.6.5)

To obtain the mOTUpan pangenome estimates, we ran:

mOTUpan.py --faas ${fasta_dir}/*.faa -o $ofile

To obtain the KO assignments, we ran KoFamScan and filtered the results/collapsed per orthogroup following the script from Bernabeu et al. 2025.

DTL

Reconciliation

We then select orthogroups with more than 4 sequences and at least one Asgard/Thermoproteota with select_orthogroups.R. Requirements:

Tidyverse (2.0.0)

We reconstruct the gene trees for these orthogroups using the workflow from Bernabeu et al. (2025). AleRax will then additionally filter these to take only gene trees with at least 4 species. We ran AleRax as follows:

srun -n 56 alerax \
-f ${input} \
-s ${tree}  \
-p ${output} \
--gene-tree-samples 100 --model-parametrization PER-FAMILY

Adding --rec-model UNDATED-DL when necessary, for both topologies.

Species-overlap (phylome)

We generated the phylomes and parsed the gene trees using the pipeline from PhylomeDB. See the manuscript for references.

Parse results

We parsed the results with these scripts:

00_Prepare_data.R : Parses AleRax output and prepares the output tables that are the base for Figure 2
01_Phylome_Data.R : Parses the Phylome data and prepares the output tables
02_Tandem_Duplications.R : Analyzes tandem duplications
03_Plot.R : Generates Figure 2 from the tables generated from previous scripts
04_Aux_Figures.R : Generates auxiliary figures

Library requirements:

phylotools 0.2.2
phytools 2.5-2
tidyverse 2.0.0
ggtree 3.16.3
ggpubr 0.6.2
gggenes 0.5.1
groupcompare 1.0.1
ggridges 0.5.7
cowplot 1.2.0

INTER-DOMAIN HORIZONTAL GENE TRANSFER

To assess HGT candidates, we first ran HGTector with the following commands:

To generate the BLAST search:

blastp -query ${genome}.faa -db ${broaddb} -outfmt "6 qaccver saccver pident evalue bitscore qcovhsp staxids" -num_threads 112 > ${genome}.out

This BLAST results are then parsed with:

hgtector search -i ${genome}.faa -m precomp -s ${genome}.out -o ${genome} -t ${taxdump}
hgtector analyze -i ${genome}.tsv -o ${genome} --t ${taxdump} --self-tax 25 --close-tax 3

Which yields a list of proteins flagged as HGT candidates. We then query the BLAST results to retrieve homologs with hgtprep.py. This is a python script that takes the following arguments:

-i : input list of putative HGT gene IDs as obtained from HGTector
-b : BLAST results from HGTector
-s : taxID/Accession from seed organism to label sequence
-f : FASTA file from seed organism
-d : BLAST database (to retrieve the sequences)
-o : output directory
-n : number of sequences to be retrieved (default 250

and requires the following packages:

biopython
pandas
sys, argparse, os
subprocess

and then reconstruct the gene trees with the pipeline from Bernabeu et al (2024). To root the trees with MAD, we used the function tree.mod.root_on_minimal_ancestor_deviation() from the package toytree.

The gene trees are then analyzed both by Abaccus and by our in-house script HGT_Detector.py. This is another python script that takes the following arguments:

-p : Directory where the gene trees are stored
-i : Input lineage (accession)
-r : rooting (midpoint or MAD)
-t : Taxonomy file, a ";"-separated table
-o : output directory

and requires the following packages:

ete3
re
functools
argparse
glob
pandas
multiprocessing

Trees that are marked as eukaryote-saturated require an additional BLAST search against BROAD-DB restricted to prokaryotes and a filtering of the eukaryotic sequences performed by rerun_neuks.R.

This generates a table that is further refined with 00_Parse_Data.R, which additionally adds GFF and functional annotation information. It is an R script that takes the following arguments:

-l : name of the lineage (accession)
-d : path to data directory
-o : output directory
-t : topology (old/alt)

and uses the following libraries:

tidyverse
ape
phylotools
argparse

With this, the figures related to inter-domain HGT are then generated with 01_Analysis_Figs.R and any additional tests with 02_Additional_tests.R.

These scripts require the following libraries:

ape
tidyverse
ggtree
ggtreeExtra
viridis
aplo
topGO
clusterProfiler
ggpubr
igraph
gggenes

Trees with eukaryotic sequences are further inspected with 03_Euk_trees.R.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DATA PREPARATION AND PANGENOME ANALYSIS

DTL

Reconciliation

Species-overlap (phylome)

Parse results

INTER-DOMAIN HORIZONTAL GENE TRANSFER

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DATA PREPARATION AND PANGENOME ANALYSIS

DTL

Reconciliation

Species-overlap (phylome)

Parse results

INTER-DOMAIN HORIZONTAL GENE TRANSFER

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages