Arcadia-Science/noveltree is a Nextflow pipeline for phylogenomic inference from whole-proteome amino acid data — automating orthology inference, multiple sequence alignment, gene-family and species tree estimation, and reconciliation-based evolutionary analysis. Input proteomes can be preprocessed using the built-in --preprocess flag or filtered externally (see preprocessing scripts).
NovelTree is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
Detailed documentation: For thorough descriptions of samplesheet preparation, all parameters, per-module options, and output files, see
docs/usage.mdanddocs/outputs.md. This README provides a concise overview to get started quickly.
NOTE: Unfortunately, at this time NovelTree is not compatible with Apple silicon/ARM architectures (e.g. M1, M2 chips).
1. Install Nextflow (>=21.10.3).
2. Install Docker.
3. Run the pipeline with the minimal test dataset:
nextflow run . -profile docker,test --outdir resultsTo constrain resource usage (e.g. on a laptop), specify limits:
nextflow run . -profile docker,test --outdir results --max_cpus 12 --max_memory 16GBReduce --max_memory by ~2 GB below your available memory to leave room for Nextflow overhead.
Note: Pre-built Docker images are pulled automatically. You only need
make docker-allif you've modified the pipeline code.
NOTE: The workflow supports both Docker and Singularity profiles.
NovelTree takes a CSV samplesheet as input. Only 3 columns are required:
species,input_data,input_type
Homo-sapiens,UP000005640,proteins
Mus-musculus,GCF_000001635.27,proteins
Drosophila-melanogaster,/path/to/Dmel.fasta,proteins
Saccharomyces-cerevisiae,https://example.com/Scer.fasta.gz,proteins| Column | Description |
|---|---|
species |
Species name in Genus-species format |
input_data |
Local file path, URL, UniProt proteome ID (UP*), or NCBI accession (GCF_*/GCA_*) |
input_type |
proteins or transcriptome |
Optional columns (has_uniprot_ids, transdecoder, filter_isoforms, reference_proteome, include_in_mcl_test, busco_shallow, busco_broad) can be added in any order after the required 3. All default to no or NA. See the full samplesheet documentation for details on all columns, data source types, and preprocessing options.
NovelTree supports three workflow modes to accommodate different use cases and computational constraints:
| Feature | Full | Simplified | Zoogle |
|---|---|---|---|
| BUSCO quality assessment | ✓ | ✗ | ✗ |
| Default aligner | Adaptive | Adaptive | Adaptive |
| Per-family GeneRax | ✓ | ✗ | ✗ |
| Per-species GeneRax | ✓ | ✓ | ✓ |
| GeneRax strategy | SPR | EVAL | EVAL |
| Phylogenetic profiles | ✓ | ✓ | ✓ |
| Physicochemical properties | ✗ | ✗ | ✓ |
| Time-calibrated species tree | ✗ | ✗ | ✓ |
| Phylo-dist analysis | ✗ | ✗ | ✓ |
Adaptive mode routes families through MAFFT (≤200 seqs), WITCH (≤3000), and FAMSA (>3000).
Which mode should I use? Use simplified mode (the default) for most analyses. Use full for smaller datasets (≤30 species) where you want additional analyses (BUSCO, per-family GeneRax). Use zoogle when you need physicochemical distance analysis for organism prioritization.
The complete pipeline with all optional analyses enabled. Best for comprehensive phylogenomic studies where accuracy is prioritized over speed.
nextflow run . -profile docker --input samplesheet.csv --outdir resultsA streamlined variant optimized for large datasets. Skips BUSCO quality assessment, runs only per-species GeneRax with the faster EVAL strategy, and skips per-family GeneRax analysis.
nextflow run . -profile docker,simplified --input samplesheet.csv --outdir resultsInherits simplified mode settings and adds analyses for organism prioritization: physicochemical protein properties, time calibration of the species tree, and phylogenetically-corrected protein distance analysis. Optionally specify a reference species for pairwise distance analysis, or use --ref_species none for centroid-only analysis.
Recommended (auto-build reference chronogram from TimeTree.org):
nextflow run . -profile docker,zoogle \
--input samplesheet.csv \
--outdir results \
--ncbi_email user@example.com \
--ref_species Genus-speciesThe pipeline queries TimeTree.org for pairwise divergence times among species in your samplesheet and builds a UPGMA reference chronogram automatically.
Alternative (provide your own reference tree):
nextflow run . -profile docker,zoogle \
--input samplesheet.csv \
--outdir results \
--reference_time_tree /path/to/reference_timetree.newick \
--ref_species Genus-speciesNovelTree includes a dedicated AWS Batch profile optimized for cloud-scale analyses:
nextflow run . \
-profile awsbatch \
--awsqueue <your-batch-queue> \
--awsregion <your-aws-region> \
-work-dir s3://<your-bucket>/work \
--outdir s3://<your-bucket>/results \
--input s3://<your-bucket>/samplesheet.csvThe awsbatch profile includes optimized executor settings (queue size of 1000 jobs) and automatic report overwriting for seamless pipeline resumption.
Requirements:
- AWS Batch compute environment and job queue configured
- Work directory (
-work-dir) and output directory (--outdir) must be S3 paths - Input samplesheet and proteome files accessible from S3
- Appropriate IAM permissions for Batch and S3 access
See the Nextflow Tower publication example in usage.md for cloud-scale configuration tips.
NovelTree supports Singularity as an alternative to Docker, which is useful for HPC environments where Docker may not be available:
nextflow run . -profile singularity --input samplesheet.csv --outdir resultsDocker images are automatically pulled and converted to Singularity format. Converted images are cached in ${outdir}/singularity_cache to avoid repeated conversions on subsequent runs.
For detailed Singularity instructions, see the Singularity documentation.
Pre-built Docker images are pulled automatically when running the pipeline. If you've modified the pipeline code or are using a custom fork, rebuild with:
make docker-allBuilding R-based images (zoogle) may take 15-20 minutes due to package compilation. Images are built for linux/amd64.
The bin/zoogle/ directory contains code vendored from the 2024-organismal-selection repository. See bin/zoogle/README.md for provenance details.
- Orthology inference — OrthoFinder normalizes sequence similarity scores and clusters proteins into gene families via MCL. An optional test step selects the best MCL inflation parameter using InterPro domain coherence (COGEQC).
- Alignment & trimming — Adaptive three-tier alignment (MAFFT → WITCH → FAMSA by family size), trimmed with ClipKIT.
- Tree inference — Gene family trees via IQ-TREE (FastTree fallback). Species tree via SpeciesRax (and optionally Asteroid).
- Reconciliation — GeneRax reconciles gene/species trees, estimating duplication and loss rates. Ortholog/paralog relationships and HOGs are parsed from reconciliation output.
- Phylogenetic profiles — Species × gene-family matrices of duplication, loss, and speciation events per species-tree node per gene family.
- Zoogle analyses (zoogle mode) — Physicochemical protein properties, time-calibrated trees, and phylogenetically-corrected protein distances for organism prioritization.
The pipeline distributes tasks in a highly parallel manner across available computational resources, supporting local execution, AWS Batch, and SLURM schedulers (see Nextflow executor documentation).
flowchart TD
INPUT["Samplesheet + Proteomes"] --> PREP["PREPARE_INPUTS<br/>Download · Preprocess · Rename"]
PREP --> BUSCO_Q{"BUSCO?<br/>(full mode)"}
BUSCO_Q -.->|yes| BUSCO["BUSCO<br/>Shallow + Broad QC"]
PREP --> ORTHO
subgraph ORTHO["INFER_ORTHOGROUPS"]
direction LR
MCL_SEL["MCL inflation<br/>selection<br/><i>(optional)</i>"] --> OF_PREP["OrthoFinder Prep<br/>+ DIAMOND"] --> MCL["MCL Clustering<br/>+ Filtering"]
end
ORTHO -->|"conservative subset<br/>(high coverage, low copy #)"| GT1["INFER_GENE_TREES<br/>species-tree families"]
ORTHO -->|"remaining subset<br/>(≥4 species)"| GT2["INFER_GENE_TREES<br/>remaining families"]
GT1 --> RECON
GT2 --> RECON
subgraph RECON["RECONCILE_TREES"]
direction LR
AST["Asteroid<br/><i>(optional)</i>"] --> SRAX["SpeciesRax"] --> GRAX["GeneRax<br/>per-species<br/>(+ per-family<br/>in full mode)"]
end
RECON --> SUMM
subgraph SUMM["RECONCILIATION_SUMMARIES"]
direction LR
PP["Phylo Profiles"] ~~~ HOG["Parse PhyloHOGs"]
end
SUMM -.->|zoogle mode| ZOOG
RECON -.->|zoogle mode| ZOOG
ORTHO -.->|zoogle mode| ZOOG
subgraph ZOOG["ZOOGLE"]
direction LR
PHYSCHEM["Protein<br/>Properties"] ~~~ TCAL["Time<br/>Calibration"] --> DATE["Date Gene<br/>Family Trees"] --> PDIST["Phylo-dist<br/>Analysis"]
end
style BUSCO_Q fill:none,stroke:#999
style BUSCO fill:#f0f0f0,stroke:#999,stroke-dasharray: 5 5
style ZOOG fill:#e8f4e8,stroke:#2d8a2d,color:#000
linkStyle 18,19 stroke:#333
Input Preparation
flowchart TD
SS["Samplesheet CSV"] --> IC["INPUT_CHECK<br/>Validate + stage"]
IC -->|remote files| DL["DOWNLOAD_INPUT<br/>S3 / URL / accession"]
IC -->|local files| MIX["All proteomes"]
DL --> MIX
MIX --> PPQ{"Preprocessing<br/>enabled?"}
PPQ -->|yes| PP["PREPROCESS_PROTEOMES<br/>TransDecoder · Isoform filter<br/>Min length · Redundancy removal"]
PPQ -->|no| RENAME
PP --> RENAME["RENAME_FASTAS<br/>Normalize species names"]
RENAME --> OUT["Renamed proteomes<br/>(ready for OrthoFinder)"]
Orthogroup Inference
flowchart TD
PROTS["Renamed Proteomes"] --> MCL_Q{"MCL testing<br/>enabled?"}
MCL_Q -->|"yes<br/>(multiple inflation values)"| ANNOT["ANNOTATE_UNIPROT<br/>InterPro domains"]
MCL_Q -->|"no<br/>(single value)"| USE_DEFAULT["Use provided<br/>inflation value"]
ANNOT --> PREP_TEST["ORTHOFINDER_PREP<br/>(test subset)"]
PREP_TEST --> BLAST_TEST["DIAMOND_BLASTP<br/>(test subset)"]
BLAST_TEST --> MCL_TEST["ORTHOFINDER_MCL<br/>(per inflation value)"]
MCL_TEST --> COGEQC["COGEQC<br/>Domain coherence scoring"]
COGEQC --> SELECT["SELECT_INFLATION<br/>Best parameter"]
SELECT --> BEST["Best inflation"]
USE_DEFAULT --> BEST
PROTS --> PREP_ALL["ORTHOFINDER_PREP<br/>(all species)"]
PREP_ALL --> BLAST_ALL["DIAMOND_BLASTP<br/>(all-vs-all)"]
BLAST_ALL --> MCL_ALL["ORTHOFINDER_MCL"]
BEST --> MCL_ALL
MCL_ALL -->|"conservative set<br/>(high coverage, low copy #)"| SPP_FAMS["Species-tree<br/>families"]
MCL_ALL -->|"remaining set<br/>(≥4 species)"| GEN_FAMS["Gene-tree<br/>families"]
Gene Tree Inference (runs once per subset)
flowchart TD
FAS["Gene Family FASTAs"] --> MODE{"Aligner?"}
MODE -->|adaptive| BRANCH{"Family size?"}
BRANCH -->|"≤200 seqs"| MAFFT["MAFFT<br/>(E-INS-i / L-INS-i)"]
BRANCH -->|"201–3000"| WITCH["WITCH"]
BRANCH -->|">3000"| FAMSA["FAMSA"]
MAFFT -.->|failure| FAMSA_FB["FAMSA<br/>(fallback)"]
WITCH -.->|failure| FAMSA_FB
MODE -->|single| SINGLE["Selected Aligner"]
MAFFT --> MSA["All MSAs"]
WITCH --> MSA
FAMSA --> MSA
FAMSA_FB --> MSA
SINGLE --> MSA
MSA --> TRIM{"Trimmer?"}
TRIM -->|clipkit| CLIPKIT["ClipKIT"]
TRIM -->|cialign| CIALIGN["CIAlign"]
TRIM -->|none| NOTRIM["No trimming"]
CLIPKIT --> CLEAN["Cleaned MSAs"]
CIALIGN --> CLEAN
NOTRIM --> CLEAN
CLEAN --> TREEQ{"Tree method?"}
TREEQ -->|iqtree| IQTREE["IQ-TREE"]
IQTREE -.->|failure| FT_FB["FastTree<br/>(fallback)"]
TREEQ -->|fasttree| FT["FastTree"]
IQTREE --> TREES["Gene Family Trees"]
FT_FB --> TREES
FT --> TREES
Species Tree & Reconciliation
flowchart TD
CORE["Core gene trees<br/>(species-tree families)"] --> OGQ{"Outgroups<br/>specified?"}
OGQ -->|yes| AST["ASTEROID<br/>Unrooted species tree"]
OGQ -->|no| SRAX
AST --> SRAX["SPECIESRAX<br/>Rooted species tree<br/>(DL model)"]
CORE --> SRAX
SRAX --> SPP["Rooted Species Tree"]
CORE --> ALL["All gene families"]
REM["Remaining gene trees"] --> ALL
SPP --> GRAX_F
ALL --> GRAX_F{"Per-family<br/>GeneRax?<br/>(full mode)"}
GRAX_F -.->|yes| PF["GENERAX_PER_FAMILY<br/>SPR strategy"]
SPP --> GRAX_S["GENERAX_PER_SPECIES<br/>SPR (full) / EVAL (simplified)"]
ALL --> GRAX_S
GRAX_S --> OUT["Reconciled trees<br/>Event counts · Species rates<br/>NHX files · Labeled species tree"]
Zoogle Analyses (zoogle mode only)
flowchart TD
OG_FAS["Original FASTAs<br/>+ Cleaned MSAs"] --> PHYSCHEM["PROTEIN_PROPERTIES<br/>AA composition · MW · pI<br/>GRAVY · Aromaticity · ..."]
REFQ{"Reference tree<br/>provided?"}
REFQ -->|no| BUILD["BUILD_REFERENCE_CHRONOGRAM<br/>TimeTree.org → UPGMA"]
REFQ -->|yes| USER["User-provided tree"]
BUILD --> REF["Reference Chronogram"]
USER --> REF
SPP["SpeciesRax<br/>species tree"] --> TCAL["TIME_CALIBRATE_SPECIES_TREE<br/>treePL penalized likelihood"]
REF --> TCAL
TCAL --> DATED_SPP["Dated species tree"]
GFT["GeneRax gene<br/>family trees"] --> DATE["DATE_GENE_FAMILY_TREES<br/>Speciation-only calibrations"]
DATED_SPP --> DATE
DATE --> DATED_GFT["Dated gene family trees"]
DATED_GFT --> ZOOG["ZOOGLE_ANALYSIS<br/>Mahalanobis distances<br/>Permutation tests"]
PHYSCHEM --> ZOOG
RELS["Ortholog / Paralog<br/>relationships"] --> ZOOG
ZOOG --> CENT["Centroid-based distances<br/>(all families)"]
ZOOG --> REFD["Reference-based distances<br/>(families with ref species)"]
NovelTree was originally written by Arcadia Science.
We try to be as open as possible with our work and make all of our code both available and usable. We love receiving feedback at any level, through comments on our pubs or Twitter and issues or pull requests here on GitHub. In turn, we routinely provide public feedback on other people’s work by commenting on preprints, filing issues on repositories when we encounter bugs, and contributing to open-source projects through pull requests and code review.
Anyone is welcome to contribute to our code. When we publish new versions of pubs, we include a link to the "Contributions" page for the relevant GitHub repo in the Acknowledgements/Contributors section. If someone’s contribution has a substantial impact on our scientific direction, the biological result of a project, or the functionality of our code, the pub’s point person may add that person as a formal contributor to the pub with "Critical Feedback" specified as their role.
Our policy is that external contributors cannot be byline-level authors on pubs, simply because we need to ensure that our byline authors are accountable for the quality and integrity of our work, and we must be able to enforce quick turnaround times for internal pub review. We apply this same policy to feedback on the text and other non-code content in pubs.
If you make a substantial contribution, you are welcome to publish it or use it in your own work (in accordance with the license — our pubs are CC BY 4.0 and our code is openly licensed). We encourage anyone to build upon our efforts.
If you use Arcadia-Science/noveltree for your analysis, please cite it using the following doi: 10.57844/arcadia-z08x-v798
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.