From ec26cfc21c84692473946bee4719e9902d44ebf8 Mon Sep 17 00:00:00 2001 From: "James A. Fellows Yates" Date: Thu, 23 Apr 2026 13:28:48 +0200 Subject: [PATCH 1/7] Simplify summary, Britishify language --- paper/paper.md | 102 ++++++++++++++++++++++++------------------------- 1 file changed, 51 insertions(+), 51 deletions(-) diff --git a/paper/paper.md b/paper/paper.md index c1e919ae..8bc65633 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -1,8 +1,10 @@ --- -title: 'nf-core/funcscan: A Nextflow pipeline to identify the biosynthetic potential and resistome of bacterial (meta)genomes' +title: "nf-core/funcscan: A Nextflow pipeline to identify the biosynthetic potential and resistome of bacterial (meta)genomes" tags: - nf-core - nextflow + - pipeline + - bioinformatics - AMP - AMR - antibiotic-resistance @@ -36,16 +38,16 @@ authors: orcid: 0009-0002-6815-8608 affiliation: 4 - name: Haidong Yi - - orcid: + - orcid: affiliation: - name: Xinpeng Zhang - - orcid: + - orcid: affiliation: - name: Alexandru Mizeranschi - - orcid: + - orcid: affiliation: - name: Dediu Codrin - - orcid: + - orcid: affiliation: - name: Moritz E. Beber orcid: 0000-0003-2406-1978 @@ -62,42 +64,40 @@ authors: orcid: 0000-0002-4528-5877 affiliation: "2, 3, 9, 10" affiliations: - - name: Department of Paleobiotechnology, Leibniz Institute for Natural Product Research and Infection Biology Hans Knöll Institute, Germany - index: 1 - - name: Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Germany - index: 2 - - name: Associated Research Group of Archaeogenetics, Leibniz Institute for Natural Product Research and Infection Biology Hans Knöll Institute, Germany - index: 3 - - name: Quantitative Biology Center (QBiC), University of Tübingen, Germany - index: 4 - - name: Institute for Globally Distributed Open Research and Education (IGDORE), Sweden - index: 5 - - name: nf-core community members are available at acknowledgments. - index: 6 - - name: M3 Research Center, Faculty of Medicine, University of Tübingen, Germany - index: 7 - - name: Department of Computer Science, Institute for Bioinformatics and Medical Informatics (IBMI), University of Tübingen, Tübingen, Germany - index: 8 - - name: Faculty of Biological Sciences, Friedrich-Schiller University Jena, Germany - index: 9 - - name: Department of Anthropology, Harvard University, USA - index: 10 - - name: Institute of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Germany - index: 11 + - name: Department of Paleobiotechnology, Leibniz Institute for Natural Product Research and Infection Biology Hans Knöll Institute, Germany + index: 1 + - name: Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Germany + index: 2 + - name: Associated Research Group of Archaeogenetics, Leibniz Institute for Natural Product Research and Infection Biology Hans Knöll Institute, Germany + index: 3 + - name: Quantitative Biology Center (QBiC), University of Tübingen, Germany + index: 4 + - name: Institute for Globally Distributed Open Research and Education (IGDORE), Sweden + index: 5 + - name: nf-core community members are available at acknowledgments. + index: 6 + - name: M3 Research Center, Faculty of Medicine, University of Tübingen, Germany + index: 7 + - name: Department of Computer Science, Institute for Bioinformatics and Medical Informatics (IBMI), University of Tübingen, Tübingen, Germany + index: 8 + - name: Faculty of Biological Sciences, Friedrich-Schiller University Jena, Germany + index: 9 + - name: Department of Anthropology, Harvard University, USA + index: 10 + - name: Institute of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Germany + index: 11 date: 14 April 2026 bibliography: paper.bib - --- # Summary -Genome-mining of bacterial DNA fosters the discovery of antimicrobial resistance-related genes as well as genes required for the biosynthesis of low molecular weight natural products or specialized metabolites. -However, current approaches to identify these functional genes remain inefficient due to heterogeneous computational platforms, accessibility, scalability, and inconsistent reporting of results of bioinformatic analysis tools. -Here, we present nf-core/funcscan, an open source bioinformatics best-practice Nextflow pipeline for the screening of functional features from assembled contigs or genomes. -nf-core/funcscan currently integrates 13 tools to simultaneously predict antimicrobial peptides, antibiotic resistance genes, biosynthetic gene clusters, and taxonomic classification from partial or full genomes. -It is straightforward to install, portable across platforms ranging from personal laptops to high-performance computing clusters, and fully reproducible via the use of software containers. -Both command-line and graphical interfaces are supported. -nf-core/funcscan also introduces standardized output formats, enabling the rapid evaluation, visualization, and interpretation of results. +Genome-mining of bacterial DNA fosters the discovery of antimicrobial resistance-related genes as well as genes required for the biosynthesis of low molecular weight natural products or specialised metabolites. +Despite the availability of many bioinformatic tools to identify such functional genes, screening of genomic features remains inefficient due to heterogeneous computational platforms, accessibility, scalability, and inconsistent reporting and formatting of the results. +Here, we present nf-core/funcscan, an open source bioinformatics pipeline for the screening of microbial functional features from assembled contigs or genomes. +The pipeline currently integrates 13 tools to simultaneously predict antimicrobial peptides, antibiotic resistance genes, biosynthetic gene clusters, and taxonomic classification from partial or full genomes. +It also introduces standardised and aggregated output file reports across all tools, enabling the rapid evaluation, visualisation, and interpretation of results. +Written in the Nextflow workflow language, it is straightforward to install, portable across platforms ranging from personal laptops to high-performance computing clusters, and fully reproducible via the use of software containers. # Statement of need @@ -111,20 +111,20 @@ However, investigating antibiotic agents in combination with antibiotic resistan Due to this pressing problem, a large suite of different tools has been developed for the rapid identification of different functional gene types. These tools use different search algorithms and databases (e.g. deepBGC: machine-learning, antiSMASH: rule-based) for the prediction of microbial metabolites, which differ in quality and quantity of the predicted properties. -Thus, to maximize the potential of detecting important functional genes, researchers often need to use multiple approaches to ensure maximum detection sensitivity across all metabolite categories. +Thus, to maximise the potential of detecting important functional genes, researchers often need to use multiple approaches to ensure maximum detection sensitivity across all metabolite categories. Since these tools are often developed as standalone tools they have to be executed separately. This renders analyses inefficient and thus impedes scalability and poses the risk of lowering reproducibility. While some tools are available as software containers (e.g. via docker, singularity), thus helping reproducibility of results, they require a series of steps to prepare input data and manually store and filter results. Additionally, standalone tools have their own unique output format, which These points strongly hamper efficiency and in many cases reproducibility of complex analyses. -Overall, in order to obtain results from various tools in a uniform format, manual inspection is still necessary. +Overall, in order to obtain results from various tools in a uniform format, manual inspection is still necessary. This renders the comparison of results from large datasets against multiple tools very impractical if not impossible. -Previous efforts to scale up the predictive power of different tools, including assembly, open reading frame (ORF) annotation, or gene-identifcation for functional prediction resulted in the genrationt of pipelines mettannotator, bacannot, SqueezeMeta, MetaErg, METABOLIC, HT-ARGfinder, ARGs-OAP, PathoFact, and antiSMASH. -However, so far, no pipeline has been created that allows for the identification and prediction of antimicrobial peptide (AMP) genes, ARGs, and biosynthetic gene clusters (BGCs) simultaneously from multiple samples in a harmonization manner. -Finally, extensive command-line knowledge, the use of shell scripts, and manual installation of software dependencies to run many of these tools, effectively precludes thir use by biochemists, biomolecular scientists, and biologists who typically have limited computational training. +Previous efforts to scale up the predictive power of different tools, including assembly, open reading frame (ORF) annotation, or gene-identifcation for functional prediction resulted in the genrationt of pipelines mettannotator, bacannot, SqueezeMeta, MetaErg, METABOLIC, HT-ARGfinder, ARGs-OAP, PathoFact, and antiSMASH. +However, so far, no pipeline has been created that allows for the identification and prediction of antimicrobial peptide (AMP) genes, ARGs, and biosynthetic gene clusters (BGCs) simultaneously from multiple samples in a harmonised manner. +Finally, extensive command-line knowledge, the use of shell scripts, and manual installation of software dependencies to run many of these tools, effectively precludes thir use by biochemists, biomolecular scientists, and biologists who typically have limited computational training. Here, we present nf-core/funcscan, a Nextflow pipeline following nf-core best practices for the simultaneous screening of multiple functional and biosynthetic components from assembled contiguous sequences (contigs), specifically predicting ARGs, BGCs, AMP-encoding genes, and providing taxonomic information of the producing organisms from (meta)genomic sequences in a portable, reproducible, and scalable manner. This allows researchers to obtain a holistic view on the genomic context of identified genes for downstream analyses in the context of antimicrobial resistance -# State of the field +# State of the field The continuing decrease in sequencing costs and the subsequent increase in available sequenced prokaryotic genomes and metagenomes has gone hand-in-hand with the development of numerous bioinformatics tools to predict gene functions. Several pipelines have been developed to chain single-purpose tools together to provide a more comprehensive context. @@ -157,7 +157,7 @@ Regarding pipeline stability and reliability, nf-core/funcscan is the only pipel # Software design nf-core/funcscan simultaneously predicts antimicrobial peptide (AMP) genes, antibiotic resistance genes (ARGs), biosynthetic gene clusters (BGCs) as well as carbohydrate active enzyme gene clusters (CGC) from partial or full (meta)genomic sequences. -In addition, the bacterial taxonomy of input sequences is determined and standardized summaries of all tool outputs are provided (Fig. \ref{fig:workflow}). +In addition, the bacterial taxonomy of input sequences is determined and standardised summaries of all tool outputs are provided (Fig. \ref{fig:workflow}). ![Workflow overview of nf-core/funcscan. (1), genomic sequences are prepared and annotated with one of four ORF annotation tools. @@ -176,7 +176,7 @@ Open reading frames are predicted from the pre-processed sequences by one of fou If annotated sequence files as described above are provided in the samplesheet, this step is skipped. Various tools of nf-core/funcscan rely on databases and reference files to operate. -The pipeline offers the functionality to download these databases automatically for the user, which can then be stored and reused in future pipeline runs to minimize pipeline runtime, network traffic, and possible download limits. +The pipeline offers the functionality to download these databases automatically for the user, which can then be stored and reused in future pipeline runs to minimise pipeline runtime, network traffic, and possible download limits. The database download is applicable for MMSeqs2, Bakta, AMPcombi, AMRFinderPlus, DeepARG, RGI, antiSMASH, DeepBGC, and InterProScan. ## Gene prediction and taxonomic classification @@ -188,17 +188,17 @@ In a second step, users can choose to scan genomic sequences in parallel with th - AMP subworkflow: ampir, AMPlify, hmmsearch, Macrel In an additional optional parallel screening step, all input sequences can be taxonomically classified by MMSeqs2 to determine likely source hosts of each functional hit. -Characterizing the taxonomic origin of metagenomic contigs can inform users about potentially suitable hosts for downstream experiments, e.g. heterologous expression systems. +Characterising the taxonomic origin of metagenomic contigs can inform users about potentially suitable hosts for downstream experiments, e.g. heterologous expression systems. The taxonomic classification supports a variety of reference databases (e.g. GTDB, UniProt, UniRef, NR, Kalamari) to suit different user requirements. Optionally, protein domains and families can be further annotated by InterProScan(37, 38). ## Aggregation of screening results All screening tools of nf-core/funcscan have heterogeneous output formats and label their respective gene predictions differently. -This hampers aggregation and cross-comparisons of results, requiring manual inspection and ‘clean up’ of results for downstream interpretation. +This hampers aggregation and cross-comparisons of results, requiring manual inspection and ‘clean up’ of results for downstream interpretation. To enable users to easily extract information for downstream analyses, nf-core/funcscan aggregates the output of all gene and taxonomic screening tools in each executed subworkflow into single human- and machine-readable tables in CSV format per gene type, thereby allowing direct comparison of gene classification with possible taxonomic sources. For the summary of ARGs we have used the existing hAMRonization software. Since similar tools do not exist for AMPs and BGCs, we developed two novel tools for the aggregation of these gene types. -comBGC and AMPcombi parse the results of BGC and AMP prediction tools and summarize them into single tables, respectively. +comBGC and AMPcombi parse the results of BGC and AMP prediction tools and summarise them into single tables, respectively. Furthermore, AMPcombi aligns the AMP hits against a reference AMP database for deeper functional classification. To assist researchers in their choice of genes for testing in wet-lab heterologous expression systems, AMPcombi provides the ability to reduce false positive hits by additional post-screening filtering steps of AMP results. Reasonable default parameters are set by the pipeline and can be adjusted by the user. @@ -206,12 +206,12 @@ Finally, three local pipeline modules merge the gene summaries with taxonomy res ## Reproducibility and scalability -All nf-core pipelines utilize software environments (conda) or containers (Docker, Singularity), which have the advantage of isolating the dependencies of all workflows from each other and rendering pipeline execution highly reproducible, portable, and platform-independent. +All nf-core pipelines utilise software environments (conda) or containers (Docker, Singularity), which have the advantage of isolating the dependencies of all workflows from each other and rendering pipeline execution highly reproducible, portable, and platform-independent. Each tool of nf-core/funcscan is automatically pulled from the respective container registry when executing a pipeline run. The pipeline itself is easy to install as it has only few minimum dependencies (Nextflow itself, and one of Docker, Singularity, Podman, Shifter, Charliecloud, and conda). The configuration of the pipeline to the underlying computing system requires knowledge of its software environment and hardware resources. -To facilitate easy configuration, nf-core provides already centralized configurations for more than 150 HPCs via the central nf-core/configs repository (https://github.com/nf-core/configs). -The performance of each pipeline run (including software versions of all applied tools, memory and CPU usage) is summarized in HTML reports for all steps of all subworkflows for users to estimate future runtime and/or computational resources. +To facilitate easy configuration, nf-core provides already centralised configurations for more than 150 HPCs via the central nf-core/configs repository (https://github.com/nf-core/configs). +The performance of each pipeline run (including software versions of all applied tools, memory and CPU usage) is summarsed in HTML reports for all steps of all subworkflows for users to estimate future runtime and/or computational resources. # Research impact statement @@ -233,6 +233,6 @@ J.F. received a fellowship from the International Leibniz Research School (under This project was funded by grants from the Werner Siemens Foundation (Paleobiotechnology to C.W. and P.S.) and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, under Germany’s Excellence Strategy – EXC 2051 – Project-ID 390713860 to C.W. and P.S.). J.A.F.Y was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 460129525 (NFDI4Microbiota, FlexFund project EnterArchaeo). -This work was supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A). +This work was supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A). -# References \ No newline at end of file +# References From 4563b19796fa52a02ff2e20c505c847d12eef802 Mon Sep 17 00:00:00 2001 From: "James A. Fellows Yates" Date: Thu, 23 Apr 2026 13:31:08 +0200 Subject: [PATCH 2/7] Fix typos --- paper/paper.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/paper/paper.md b/paper/paper.md index 8bc65633..e1d23f80 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -102,7 +102,7 @@ Written in the Nextflow workflow language, it is straightforward to install, por # Statement of need The emergence and spread of multidrug resistant microbial pathogens poses a serious threat to global health. -Traditionally, most antiinfective drugs have been derived from bacterially produced low molecular weight natural products. +Traditionally, most anti-infective drugs have been derived from bacterially produced low molecular weight natural products. To ensure self-resistance against antimicrobial agents, the producing bacteria typically exhibit resistance mechanisms. As a consequence, the evolution of antimicrobials and the corresponding resistance mechanisms are strongly correlated. Although antibiotic resistance is tightly linked to self-protection of the producing organisms, the recent excessive use of antibiotics and lack of global surveillance both in healthcare and agriculture has led to an explosion of multidrug resistant bacteria. @@ -112,13 +112,13 @@ However, investigating antibiotic agents in combination with antibiotic resistan Due to this pressing problem, a large suite of different tools has been developed for the rapid identification of different functional gene types. These tools use different search algorithms and databases (e.g. deepBGC: machine-learning, antiSMASH: rule-based) for the prediction of microbial metabolites, which differ in quality and quantity of the predicted properties. Thus, to maximise the potential of detecting important functional genes, researchers often need to use multiple approaches to ensure maximum detection sensitivity across all metabolite categories. -Since these tools are often developed as standalone tools they have to be executed separately. +Since these tools are often developed as stand-alone tools they have to be executed separately. This renders analyses inefficient and thus impedes scalability and poses the risk of lowering reproducibility. While some tools are available as software containers (e.g. via docker, singularity), thus helping reproducibility of results, they require a series of steps to prepare input data and manually store and filter results. Additionally, standalone tools have their own unique output format, which These points strongly hamper efficiency and in many cases reproducibility of complex analyses. Overall, in order to obtain results from various tools in a uniform format, manual inspection is still necessary. This renders the comparison of results from large datasets against multiple tools very impractical if not impossible. -Previous efforts to scale up the predictive power of different tools, including assembly, open reading frame (ORF) annotation, or gene-identifcation for functional prediction resulted in the genrationt of pipelines mettannotator, bacannot, SqueezeMeta, MetaErg, METABOLIC, HT-ARGfinder, ARGs-OAP, PathoFact, and antiSMASH. +Previous efforts to scale up the predictive power of different tools, including assembly, open reading frame (ORF) annotation, or gene-identification for functional prediction resulted in the generation of pipelines mettannotator, bacannot, SqueezeMeta, MetaErg, METABOLIC, HT-ARGfinder, ARGs-OAP, PathoFact, and antiSMASH. However, so far, no pipeline has been created that allows for the identification and prediction of antimicrobial peptide (AMP) genes, ARGs, and biosynthetic gene clusters (BGCs) simultaneously from multiple samples in a harmonised manner. Finally, extensive command-line knowledge, the use of shell scripts, and manual installation of software dependencies to run many of these tools, effectively precludes thir use by biochemists, biomolecular scientists, and biologists who typically have limited computational training. Here, we present nf-core/funcscan, a Nextflow pipeline following nf-core best practices for the simultaneous screening of multiple functional and biosynthetic components from assembled contiguous sequences (contigs), specifically predicting ARGs, BGCs, AMP-encoding genes, and providing taxonomic information of the producing organisms from (meta)genomic sequences in a portable, reproducible, and scalable manner. @@ -167,13 +167,13 @@ Two additional classification workflows can be used to classify contigs taxonomi ## Input pre-processing and open reading frame annotation -The pipeline processes a two- to four- column table (comma-separated, CSV format) samplesheet as input. +The pipeline processes a two- to four- column table (comma-separated, CSV format) sample-sheet as input. Sample names and paths to the respective nucleotide FASTA files containing (meta)genomic contigs or genomes to be screened are required. -Optionally, preannotated sequence files can be supplied to the pipeline in a four-column samplesheet with open reading frame amino acid sequences in FASTA format, and their respective annotations in GenBank Flat File format. +Optionally, pre-annotated sequence files can be supplied to the pipeline in a four-column sample-sheet with open reading frame amino acid sequences in FASTA format, and their respective annotations in GenBank Flat File format. During preprocessing, any gzipped sequence files are decompressed, and, when running the BGC subworkflow, short contigs are removed by SeqKit (default: contigs shorter than 3,000 bp). The latter step reduces the runtime of the pipeline by removing too-short sequences that would produce no biologically meaningful BGC results. -Open reading frames are predicted from the pre-processed sequences by one of four annotation tools (Bakta, Prodigal, Prokka, and Pyrodigal). -If annotated sequence files as described above are provided in the samplesheet, this step is skipped. +Open reading frames are predicted from the preprocessed sequences by one of four annotation tools (Bakta, Prodigal, Prokka, and Pyrodigal). +If annotated sequence files as described above are provided in the sample-sheet, this step is skipped. Various tools of nf-core/funcscan rely on databases and reference files to operate. The pipeline offers the functionality to download these databases automatically for the user, which can then be stored and reused in future pipeline runs to minimise pipeline runtime, network traffic, and possible download limits. @@ -211,7 +211,7 @@ Each tool of nf-core/funcscan is automatically pulled from the respective contai The pipeline itself is easy to install as it has only few minimum dependencies (Nextflow itself, and one of Docker, Singularity, Podman, Shifter, Charliecloud, and conda). The configuration of the pipeline to the underlying computing system requires knowledge of its software environment and hardware resources. To facilitate easy configuration, nf-core provides already centralised configurations for more than 150 HPCs via the central nf-core/configs repository (https://github.com/nf-core/configs). -The performance of each pipeline run (including software versions of all applied tools, memory and CPU usage) is summarsed in HTML reports for all steps of all subworkflows for users to estimate future runtime and/or computational resources. +The performance of each pipeline run (including software versions of all applied tools, memory and CPU usage) is summarised in HTML reports for all steps of all subworkflows for users to estimate future runtime and/or computational resources. # Research impact statement From 8ec8af2c8e3b77f500711c9aa9984aecf38c1254 Mon Sep 17 00:00:00 2001 From: "James A. Fellows Yates" Date: Thu, 23 Apr 2026 13:39:40 +0200 Subject: [PATCH 3/7] Typos, grammar fixes --- paper/paper.md | 35 +++++++++++++++++++---------------- 1 file changed, 19 insertions(+), 16 deletions(-) diff --git a/paper/paper.md b/paper/paper.md index e1d23f80..b7fb6a13 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -114,14 +114,15 @@ These tools use different search algorithms and databases (e.g. deepBGC: machine Thus, to maximise the potential of detecting important functional genes, researchers often need to use multiple approaches to ensure maximum detection sensitivity across all metabolite categories. Since these tools are often developed as stand-alone tools they have to be executed separately. This renders analyses inefficient and thus impedes scalability and poses the risk of lowering reproducibility. -While some tools are available as software containers (e.g. via docker, singularity), thus helping reproducibility of results, they require a series of steps to prepare input data and manually store and filter results. -Additionally, standalone tools have their own unique output format, which These points strongly hamper efficiency and in many cases reproducibility of complex analyses. +While some tools are available as software containers (e.g. via Docker, singularity), thus helping reproducibility of results, they require a series of steps to prepare input data and manually store and filter results. +Additionally, stand-alone tools have their own unique output format, which These points strongly hamper efficiency and in many cases reproducibility of complex analyses. Overall, in order to obtain results from various tools in a uniform format, manual inspection is still necessary. -This renders the comparison of results from large datasets against multiple tools very impractical if not impossible. -Previous efforts to scale up the predictive power of different tools, including assembly, open reading frame (ORF) annotation, or gene-identification for functional prediction resulted in the generation of pipelines mettannotator, bacannot, SqueezeMeta, MetaErg, METABOLIC, HT-ARGfinder, ARGs-OAP, PathoFact, and antiSMASH. +This renders the comparison of results from large datasets against multiple tools very impractical or impossible. +Previous efforts to scale up the predictive power of different tools, including assembly, open reading frame (ORF) annotation, or gene-identification for functional prediction include pipelines such as mettannotator, bacannot, SqueezeMeta, MetaErg, METABOLIC, HT-ARGfinder, ARGs-OAP, PathoFact, and antiSMASH. However, so far, no pipeline has been created that allows for the identification and prediction of antimicrobial peptide (AMP) genes, ARGs, and biosynthetic gene clusters (BGCs) simultaneously from multiple samples in a harmonised manner. -Finally, extensive command-line knowledge, the use of shell scripts, and manual installation of software dependencies to run many of these tools, effectively precludes thir use by biochemists, biomolecular scientists, and biologists who typically have limited computational training. -Here, we present nf-core/funcscan, a Nextflow pipeline following nf-core best practices for the simultaneous screening of multiple functional and biosynthetic components from assembled contiguous sequences (contigs), specifically predicting ARGs, BGCs, AMP-encoding genes, and providing taxonomic information of the producing organisms from (meta)genomic sequences in a portable, reproducible, and scalable manner. +Finally, extensive command-line knowledge, the use of shell scripts, and manual installation of software dependencies to run many of these tools, effectively precludes their use by biochemists, biomolecular scientists, and biologists who typically have limited computational training. +Here, we present nf-core/funcscan, a Nextflow pipeline following nf-core best practices for the simultaneous screening of multiple functional and biosynthetic components from assembled contiguous sequences (contigs). +The pipeline predicts ARGs, BGCs, AMP-encoding genes, and providing taxonomic information of the producing organisms from (meta)genomic sequences in a portable, reproducible, and scalable manner. This allows researchers to obtain a holistic view on the genomic context of identified genes for downstream analyses in the context of antimicrobial resistance # State of the field @@ -143,7 +144,7 @@ Regarding pipeline stability and reliability, nf-core/funcscan is the only pipel | CAZyme screening | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | | Taxonomic assignment of contigs | ✓ | ✗ | ✗ | (✗) | (✗) | ✓ | ✓ | ✗ | | Results summary | ✓ | ✓ | ✓ | (✓) | (✓) | ✓ | ✓ | ✗ | -| Container support (docker, singularity) | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | (✗) | +| Container support (Docker, Singularity) | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | (✗) | | Modularity | ✓ | ✓ | ✓ | ✗ | ✓ | (✓) | ✗ | ✗ | | One-click installation | ✓ | ✓ | ✓ | ✗ | ✗ | (✗) | ✗ | ✗ | | Local installation possible | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | @@ -165,7 +166,7 @@ Two additional classification workflows can be used to classify contigs taxonomi (2), depending on which workflows are selected by the user, the biosynthetic gene cluster (BGC, purple), antimicrobial peptide (AMP, orange), antibiotic resistance gene (ARG, yellow), or carbohydrate-active enzymes (CAZyme) workflows with their customisable parameters are executed. (3), the results of all tools for each gene category are aggregated and saved in a human- and machine-readable tabular format.\label{fig:workflow}](figure1.png) -## Input pre-processing and open reading frame annotation +## Input preprocessing and open reading frame annotation The pipeline processes a two- to four- column table (comma-separated, CSV format) sample-sheet as input. Sample names and paths to the respective nucleotide FASTA files containing (meta)genomic contigs or genomes to be screened are required. @@ -196,7 +197,8 @@ Optionally, protein domains and families can be further annotated by InterProSca All screening tools of nf-core/funcscan have heterogeneous output formats and label their respective gene predictions differently. This hampers aggregation and cross-comparisons of results, requiring manual inspection and ‘clean up’ of results for downstream interpretation. -To enable users to easily extract information for downstream analyses, nf-core/funcscan aggregates the output of all gene and taxonomic screening tools in each executed subworkflow into single human- and machine-readable tables in CSV format per gene type, thereby allowing direct comparison of gene classification with possible taxonomic sources. +To enable users to easily extract information for downstream analyses, nf-core/funcscan aggregates the output of all gene and taxonomic screening tools in each executed subworkflow into single human- and machine-readable tables in CSV format per gene type. +These aggregated tables allows direct comparison of gene classification with possible taxonomic sources. For the summary of ARGs we have used the existing hAMRonization software. Since similar tools do not exist for AMPs and BGCs, we developed two novel tools for the aggregation of these gene types. comBGC and AMPcombi parse the results of BGC and AMP prediction tools and summarise them into single tables, respectively. Furthermore, AMPcombi aligns the AMP hits against a reference AMP database for deeper functional classification. @@ -206,18 +208,19 @@ Finally, three local pipeline modules merge the gene summaries with taxonomy res ## Reproducibility and scalability -All nf-core pipelines utilise software environments (conda) or containers (Docker, Singularity), which have the advantage of isolating the dependencies of all workflows from each other and rendering pipeline execution highly reproducible, portable, and platform-independent. +All nf-core pipelines utilise software environments (Conda) or containers (Docker, Singularity), which have the advantage of isolating the dependencies of all workflows from each other and rendering pipeline execution highly reproducible, portable, and platform-independent. Each tool of nf-core/funcscan is automatically pulled from the respective container registry when executing a pipeline run. -The pipeline itself is easy to install as it has only few minimum dependencies (Nextflow itself, and one of Docker, Singularity, Podman, Shifter, Charliecloud, and conda). +The pipeline itself is easy to install as it has only few minimum dependencies (Nextflow itself, and one of Docker, Singularity, Podman, Shifter, Charliecloud, and Conda). The configuration of the pipeline to the underlying computing system requires knowledge of its software environment and hardware resources. -To facilitate easy configuration, nf-core provides already centralised configurations for more than 150 HPCs via the central nf-core/configs repository (https://github.com/nf-core/configs). -The performance of each pipeline run (including software versions of all applied tools, memory and CPU usage) is summarised in HTML reports for all steps of all subworkflows for users to estimate future runtime and/or computational resources. +To facilitate easy configuration, nf-core provides already centralised configurations for more than 150 HPCs via the central nf-core/configs repository ([https://github.com/nf-core/configs](https://github.com/nf-core/configs)). +The performance of each pipeline run (including software versions of all applied tools, memory, and CPU usage) is summarised in HTML reports for all steps of all subworkflows for users to estimate future runtime and/or computational resources. # Research impact statement nf-core/funcscan has developed an active user community of scientific users and developers who continuously contribute ideas, bug reports and code via issues and pull requests on GitHub. -An exemplary case in point is the contribution of a whole new workflow (CAZyme screening) by new community members. -Discussions of pipeline as well as research domain related topics happen on the open-to-join nf-core workspace on the Slack platform, illustrating the public interest and pro-active efforts from scientific users to use, maintain, and improve the pipeline functionalities. +The pipeline is already being actively used in research (https://www.mdpi.com/2076-2607/14/1/145, https://link.springer.com/article/10.1007/s12602-025-10718-9, https://pmc.ncbi.nlm.nih.gov/articles/PMC12051446/, https://link.springer.com/article/10.1007/s12223-026-01445-x) +Additionally, the pipeline received a contribution of a whole new workflow (CAZyme screening) by new community members outside of the original developers. +Discussions of pipeline as well as research domain related topics happen on the open-to-join nf-core workspace on the Slack platform. This illustrates the public interest and proactive efforts from scientific users to use, maintain, and improve the pipeline.functionalities. # AI usage disclosure @@ -227,7 +230,7 @@ of this manuscript, or the preparation of supporting materials. # Acknowledgements We thank Vedanth Ramji for adding argNorm to the ARG subworkflow. -A full list of nf-core community members is available at https://nf-co.re/contributors/. +A full list of nf-core community members is available at [https://nf-co.re/contributors/](https://nf-co.re/contributors/). We thank Martin Klapper and Rosa Herbst for helpful feedback on relevant BGC and AMP properties during comBGC and AMPcombi development. J.F. received a fellowship from the International Leibniz Research School (under the head of the Jena School for Microbial Communication, JSMC). From a55cb5ad8b178d465d471ca6f8112df428ae23cd Mon Sep 17 00:00:00 2001 From: "James A. Fellows Yates" Date: Thu, 23 Apr 2026 14:06:42 +0200 Subject: [PATCH 4/7] Further comments/re-writes suggestions and cutting down --- paper/paper.md | 86 ++++++++++++++++++++++++-------------------------- 1 file changed, 41 insertions(+), 45 deletions(-) diff --git a/paper/paper.md b/paper/paper.md index b7fb6a13..fd68fb1f 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -107,33 +107,34 @@ To ensure self-resistance against antimicrobial agents, the producing bacteria t As a consequence, the evolution of antimicrobials and the corresponding resistance mechanisms are strongly correlated. Although antibiotic resistance is tightly linked to self-protection of the producing organisms, the recent excessive use of antibiotics and lack of global surveillance both in healthcare and agriculture has led to an explosion of multidrug resistant bacteria. Over the past few decades, the spread of antibiotic resistance genes (ARGs) and pathogenic bacteria carrying them has grown to a major threat to human health. -However, investigating antibiotic agents in combination with antibiotic resistance mechanisms and ARG evolution has the potential to aid in the development of new antibiotics. - -Due to this pressing problem, a large suite of different tools has been developed for the rapid identification of different functional gene types. -These tools use different search algorithms and databases (e.g. deepBGC: machine-learning, antiSMASH: rule-based) for the prediction of microbial metabolites, which differ in quality and quantity of the predicted properties. -Thus, to maximise the potential of detecting important functional genes, researchers often need to use multiple approaches to ensure maximum detection sensitivity across all metabolite categories. -Since these tools are often developed as stand-alone tools they have to be executed separately. -This renders analyses inefficient and thus impedes scalability and poses the risk of lowering reproducibility. -While some tools are available as software containers (e.g. via Docker, singularity), thus helping reproducibility of results, they require a series of steps to prepare input data and manually store and filter results. -Additionally, stand-alone tools have their own unique output format, which These points strongly hamper efficiency and in many cases reproducibility of complex analyses. -Overall, in order to obtain results from various tools in a uniform format, manual inspection is still necessary. -This renders the comparison of results from large datasets against multiple tools very impractical or impossible. -Previous efforts to scale up the predictive power of different tools, including assembly, open reading frame (ORF) annotation, or gene-identification for functional prediction include pipelines such as mettannotator, bacannot, SqueezeMeta, MetaErg, METABOLIC, HT-ARGfinder, ARGs-OAP, PathoFact, and antiSMASH. -However, so far, no pipeline has been created that allows for the identification and prediction of antimicrobial peptide (AMP) genes, ARGs, and biosynthetic gene clusters (BGCs) simultaneously from multiple samples in a harmonised manner. -Finally, extensive command-line knowledge, the use of shell scripts, and manual installation of software dependencies to run many of these tools, effectively precludes their use by biochemists, biomolecular scientists, and biologists who typically have limited computational training. -Here, we present nf-core/funcscan, a Nextflow pipeline following nf-core best practices for the simultaneous screening of multiple functional and biosynthetic components from assembled contiguous sequences (contigs). -The pipeline predicts ARGs, BGCs, AMP-encoding genes, and providing taxonomic information of the producing organisms from (meta)genomic sequences in a portable, reproducible, and scalable manner. +Identifying new antibiotic agents from novel sources in combination with antibiotic resistance mechanisms and ARG evolution has the potential to aid in the development of new antibiotics. + +Due to this pressing problem, a large suite of different tools has been developed for the rapid identification of different functional gene types from sequencing data. +These tools use different search algorithms and databases (e.g. deepBGC: machine-learning, antiSMASH: rule-based) for the prediction of different types of microbial metabolites. +To maximise the potential of detecting important functional genes, researchers often need to use multiple approaches to ensure maximum detection sensitivity during screening. +Since these tools are often developed as stand-alone tools with specific databases they have to be executed separately. +This impedes scalability due to inefficiency and additionally poses an increased risk of lowering reproducibility when executed manually. +While some tools are available as software containers (e.g. via Docker, Singularity), thus helping reproducibility of results, they often require a series of steps to prepare input data and manually store and filter results. +Additionally, stand-alone tools have their own unique output format, making cross-comparison of the results between different tools nontrivial, and often results in manual processing and inspection - again further restricting scalability. + +Previous efforts to scale up the predictive power of different tools, including (meta)genomic assembly, open reading frame (ORF) annotation, or gene-identification for functional prediction include pipelines such as mettannotator, bacannot, SqueezeMeta, MetaErg, METABOLIC, HT-ARGfinder, ARGs-OAP, PathoFact, and antiSMASH. +However, to our knowledge, no pipeline has been created that allows for the identification and prediction of antimicrobial peptide (AMP) genes, ARGs, and biosynthetic gene clusters (BGCs) simultaneously from multiple samples in a harmonised manner. +Additionally, extensive command-line knowledge and manual installation of software dependencies are required to run many of these existing pipelines. +This effectively precludes their use by biochemists, biomolecular scientists, and biologists who typically have limited computational training. + +Here, we present nf-core/funcscan, a Nextflow pipeline following nf-core best practices for the simultaneous screening of multiple functional and biosynthetic components from assembled microbial contiguous sequences (contigs). +The pipeline predicts ARGs, BGCs, AMP-encoding genes, and providing taxonomic information of the producing organisms from (meta)genomic sequences parallel in a portable, reproducible, and scalable manner. This allows researchers to obtain a holistic view on the genomic context of identified genes for downstream analyses in the context of antimicrobial resistance # State of the field The continuing decrease in sequencing costs and the subsequent increase in available sequenced prokaryotic genomes and metagenomes has gone hand-in-hand with the development of numerous bioinformatics tools to predict gene functions. Several pipelines have been developed to chain single-purpose tools together to provide a more comprehensive context. -nf-core/funcscan is to the best of our knowledge the first pipeline to combine screening for AMPs, ARGs, and BGCs. -However, pipelines with similar functionality have been developed, with the closest one being mettannotator (Table 1). -This pipeline meets all criteria of scalability and reproducibility on the same level as nf-core/funcscan because it is likewise written in Nextflow and in most parts based on the nf-core pipeline template. -While focussing on somewhat different gene types (e.g. snRNA, mobilome), shared features include ARG and BGC prediction as well as aggregation of results. -In contrast, nf-core/funcscan provides additional AMP screening, CAZyme screening, and the integration of taxonomic classifications for all genes. + +Pipelines with similar functionality to nf-core/funcscan have been developed, with the most similar being mettannotator (Table 1). +This pipeline meets the criteria of scalability and reproducibility on the same level as nf-core/funcscan, due to its similar implementation in Nextflow and in most parts also based on the nf-core pipeline template. +While focused on somewhat different gene types (e.g. snRNA, mobilome), shared features include ARG and BGC prediction as well as aggregation of results. +In contrast, nf-core/funcscan provides additional AMP screening, CAZyme screening, and the integration of taxonomic classifications for all genes to provide additional ecological context around predicted genes. Regarding pipeline stability and reliability, nf-core/funcscan is the only pipeline to implement comprehensive unit tests on module and pipeline level, using the nf-test framework (Table \ref{tab:pipelines}). | Feature | funcscan | mettannotator | bacannot | HT-ARGfinder | PathoFact | SqueezeMeta | MetaERG | ARGs-OAP | @@ -155,7 +156,7 @@ Regarding pipeline stability and reliability, nf-core/funcscan is the only pipel : Comparison of nf-core/funcscan with other related pipelines for ARG, AMP, and BGC discovery. Parentheses indicate either unspecific gene screening or partly fulfilled criteria. \label{tab:pipelines} -# Software design +# Workflow overview nf-core/funcscan simultaneously predicts antimicrobial peptide (AMP) genes, antibiotic resistance genes (ARGs), biosynthetic gene clusters (BGCs) as well as carbohydrate active enzyme gene clusters (CGC) from partial or full (meta)genomic sequences. In addition, the bacterial taxonomy of input sequences is determined and standardised summaries of all tool outputs are provided (Fig. \ref{fig:workflow}). @@ -170,10 +171,9 @@ Two additional classification workflows can be used to classify contigs taxonomi The pipeline processes a two- to four- column table (comma-separated, CSV format) sample-sheet as input. Sample names and paths to the respective nucleotide FASTA files containing (meta)genomic contigs or genomes to be screened are required. -Optionally, pre-annotated sequence files can be supplied to the pipeline in a four-column sample-sheet with open reading frame amino acid sequences in FASTA format, and their respective annotations in GenBank Flat File format. -During preprocessing, any gzipped sequence files are decompressed, and, when running the BGC subworkflow, short contigs are removed by SeqKit (default: contigs shorter than 3,000 bp). -The latter step reduces the runtime of the pipeline by removing too-short sequences that would produce no biologically meaningful BGC results. -Open reading frames are predicted from the preprocessed sequences by one of four annotation tools (Bakta, Prodigal, Prokka, and Pyrodigal). +Optionally, pre-annotated sequence files can be supplied to the pipeline in the four-column sample-sheet variant with ORF amino acid sequences in FASTA format, and their respective annotations in GenBank Flat File format. +During preprocessing, any gzipped sequence files are decompressed, and, when running the BGC subworkflow, short contigs are removed by SeqKit (default: contigs shorter than 3,000 bp) to reduce runtime by removing too-short sequences that produce no biologically meaningful results. +Open reading frames are predicted from the preprocessed sequences by one of four prokaryotic annotation tools (Bakta, Prodigal, Prokka, and Pyrodigal). If annotated sequence files as described above are provided in the sample-sheet, this step is skipped. Various tools of nf-core/funcscan rely on databases and reference files to operate. @@ -189,43 +189,39 @@ In a second step, users can choose to scan genomic sequences in parallel with th - AMP subworkflow: ampir, AMPlify, hmmsearch, Macrel In an additional optional parallel screening step, all input sequences can be taxonomically classified by MMSeqs2 to determine likely source hosts of each functional hit. -Characterising the taxonomic origin of metagenomic contigs can inform users about potentially suitable hosts for downstream experiments, e.g. heterologous expression systems. +Characterising the taxonomic origin of metagenomic contigs can provide users information about potentially suitable hosts for downstream experiments, e.g. heterologous expression systems. The taxonomic classification supports a variety of reference databases (e.g. GTDB, UniProt, UniRef, NR, Kalamari) to suit different user requirements. -Optionally, protein domains and families can be further annotated by InterProScan(37, 38). +Optionally, protein domains and families can be further annotated by InterProScan. + +Reasonable default parameters for commonly tuned parameters of the screening tools are set by the pipeline, and can be adjusted by the user by dedicated command-line arguments or via a Nextflow parameter file. ## Aggregation of screening results All screening tools of nf-core/funcscan have heterogeneous output formats and label their respective gene predictions differently. -This hampers aggregation and cross-comparisons of results, requiring manual inspection and ‘clean up’ of results for downstream interpretation. -To enable users to easily extract information for downstream analyses, nf-core/funcscan aggregates the output of all gene and taxonomic screening tools in each executed subworkflow into single human- and machine-readable tables in CSV format per gene type. -These aggregated tables allows direct comparison of gene classification with possible taxonomic sources. -For the summary of ARGs we have used the existing hAMRonization software. Since similar tools do not exist for AMPs and BGCs, we developed two novel tools for the aggregation of these gene types. -comBGC and AMPcombi parse the results of BGC and AMP prediction tools and summarise them into single tables, respectively. -Furthermore, AMPcombi aligns the AMP hits against a reference AMP database for deeper functional classification. -To assist researchers in their choice of genes for testing in wet-lab heterologous expression systems, AMPcombi provides the ability to reduce false positive hits by additional post-screening filtering steps of AMP results. -Reasonable default parameters are set by the pipeline and can be adjusted by the user. -Finally, three local pipeline modules merge the gene summaries with taxonomy results. +nf-core/funcscan aggregates the output of all gene and taxonomic screening tools in each executed subworkflow into single human- and machine-readable tables in CSV format per gene type using dedicated tools. +For the summary of ARGs we have used the existing hAMRonization software. +For AMPs AMPcombi parse the results of AMP prediction tools and summarise them into single tables, and aligns the AMP hits against a reference AMP database for deeper functional classification. +We wrote a custom script 'comBGC' for aggregating and standardising the output of the BGC tools. ## Reproducibility and scalability -All nf-core pipelines utilise software environments (Conda) or containers (Docker, Singularity), which have the advantage of isolating the dependencies of all workflows from each other and rendering pipeline execution highly reproducible, portable, and platform-independent. -Each tool of nf-core/funcscan is automatically pulled from the respective container registry when executing a pipeline run. -The pipeline itself is easy to install as it has only few minimum dependencies (Nextflow itself, and one of Docker, Singularity, Podman, Shifter, Charliecloud, and Conda). +All nf-core pipelines utilise software environments (Conda) or containers (Docker, Singularity) for each integrated tool. +This provides the advantage of isolating the dependencies of all workflows from each other and rendering pipeline execution highly reproducible, portable, and platform-independent. +Thus, the pipeline itself is easy to install as it has only few minimum dependencies (Nextflow itself, and one of Docker, Singularity, Podman, Shifter, Charliecloud, and Conda). The configuration of the pipeline to the underlying computing system requires knowledge of its software environment and hardware resources. -To facilitate easy configuration, nf-core provides already centralised configurations for more than 150 HPCs via the central nf-core/configs repository ([https://github.com/nf-core/configs](https://github.com/nf-core/configs)). +To facilitate configuration and further portability, nf-core provides already centralised configurations for more than 150 institutional computational infrastructures (e.g. HPCs) via the central nf-core/configs repository ([https://github.com/nf-core/configs](https://github.com/nf-core/configs)). The performance of each pipeline run (including software versions of all applied tools, memory, and CPU usage) is summarised in HTML reports for all steps of all subworkflows for users to estimate future runtime and/or computational resources. # Research impact statement nf-core/funcscan has developed an active user community of scientific users and developers who continuously contribute ideas, bug reports and code via issues and pull requests on GitHub. -The pipeline is already being actively used in research (https://www.mdpi.com/2076-2607/14/1/145, https://link.springer.com/article/10.1007/s12602-025-10718-9, https://pmc.ncbi.nlm.nih.gov/articles/PMC12051446/, https://link.springer.com/article/10.1007/s12223-026-01445-x) +The pipeline is already being actively used in research (https://www.mdpi.com/2076-2607/14/1/145, https://link.springer.com/article/10.1007/s12602-025-10718-9, https://pmc.ncbi.nlm.nih.gov/articles/PMC12051446/, https://link.springer.com/article/10.1007/s12223-026-01445-x). Additionally, the pipeline received a contribution of a whole new workflow (CAZyme screening) by new community members outside of the original developers. Discussions of pipeline as well as research domain related topics happen on the open-to-join nf-core workspace on the Slack platform. This illustrates the public interest and proactive efforts from scientific users to use, maintain, and improve the pipeline.functionalities. # AI usage disclosure -No generative AI tools were used in the development of this software, the writing -of this manuscript, or the preparation of supporting materials. +No generative AI tools were used in the development of this software, the writing of this manuscript, or the preparation of supporting materials. # Acknowledgements @@ -235,7 +231,7 @@ We thank Martin Klapper and Rosa Herbst for helpful feedback on relevant BGC and J.F. received a fellowship from the International Leibniz Research School (under the head of the Jena School for Microbial Communication, JSMC). This project was funded by grants from the Werner Siemens Foundation (Paleobiotechnology to C.W. and P.S.) and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, under Germany’s Excellence Strategy – EXC 2051 – Project-ID 390713860 to C.W. and P.S.). -J.A.F.Y was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 460129525 (NFDI4Microbiota, FlexFund project EnterArchaeo). +J.A.F.Y and C.W. was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 460129525 (NFDI4Microbiota, FlexFund project EnterArchaeo). This work was supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A). # References From 88a27aa4899469da4e49e08e132975c969f5328c Mon Sep 17 00:00:00 2001 From: Jasmin Frangenberg <73216762+jasmezz@users.noreply.github.com> Date: Wed, 29 Apr 2026 14:30:58 +0000 Subject: [PATCH 5/7] Apply suggestions from review Co-authored-by: Jasmin Frangenberg <73216762+jasmezz@users.noreply.github.com> --- paper/paper.md | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/paper/paper.md b/paper/paper.md index fd68fb1f..bdad7b6c 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -117,13 +117,13 @@ This impedes scalability due to inefficiency and additionally poses an increased While some tools are available as software containers (e.g. via Docker, Singularity), thus helping reproducibility of results, they often require a series of steps to prepare input data and manually store and filter results. Additionally, stand-alone tools have their own unique output format, making cross-comparison of the results between different tools nontrivial, and often results in manual processing and inspection - again further restricting scalability. -Previous efforts to scale up the predictive power of different tools, including (meta)genomic assembly, open reading frame (ORF) annotation, or gene-identification for functional prediction include pipelines such as mettannotator, bacannot, SqueezeMeta, MetaErg, METABOLIC, HT-ARGfinder, ARGs-OAP, PathoFact, and antiSMASH. +Previous efforts to scale up the predictive power of different tools for functional gene prediction include pipelines such as mettannotator, bacannot, SqueezeMeta, MetaErg, METABOLIC, HT-ARGfinder, ARGs-OAP, PathoFact, and antiSMASH. However, to our knowledge, no pipeline has been created that allows for the identification and prediction of antimicrobial peptide (AMP) genes, ARGs, and biosynthetic gene clusters (BGCs) simultaneously from multiple samples in a harmonised manner. Additionally, extensive command-line knowledge and manual installation of software dependencies are required to run many of these existing pipelines. This effectively precludes their use by biochemists, biomolecular scientists, and biologists who typically have limited computational training. Here, we present nf-core/funcscan, a Nextflow pipeline following nf-core best practices for the simultaneous screening of multiple functional and biosynthetic components from assembled microbial contiguous sequences (contigs). -The pipeline predicts ARGs, BGCs, AMP-encoding genes, and providing taxonomic information of the producing organisms from (meta)genomic sequences parallel in a portable, reproducible, and scalable manner. +The pipeline predicts ARGs, BGCs, AMP-encoding genes, and provides taxonomic information of the producing organisms from (meta)genomic sequences parallel in a portable, reproducible, and scalable manner. This allows researchers to obtain a holistic view on the genomic context of identified genes for downstream analyses in the context of antimicrobial resistance # State of the field @@ -151,7 +151,19 @@ Regarding pipeline stability and reliability, nf-core/funcscan is the only pipel | Local installation possible | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | | Web-based execution possible | (✓) | (✓) | (✓) | ✗ | ✗ | ✗ | ✗ | ✗ | | Software reviewing | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | -| Automated unit tests | ✓ | (✗) | (✗) | (✗) | (✗) | ✗ | ✗ | ✗ | +| ARG screening | + | + | + | + | + | (+) | (+) | + | +| AMP screening | + | − | − | − | − | (+) | (+) | − | +| BGC screening | + | + | − | − | − | (−) | (−) | − | +| CAZyme screening | + | + | − | − | − | − | − | − | +| Taxonomic assignment of contigs | + | − | − | (−) | (−) | + | + | − | +| Results summary | + | + | + | (+) | (+) | + | + | − | +| Container support (Docker, Singularity) | + | + | + | − | − | − | + | (−) | +| Modularity | + | + | + | − | + | (+) | − | − | +| One-click installation | + | + | + | − | − | (−) | − | − | +| Local installation possible | + | + | + | + | + | + | + | − | +| Web-based execution possible | (+) | (+) | (+) | − | − | − | − | − | +| Software reviewing | + | + | − | − | − | − | − | − | +| Automated unit tests | + | + | (−) | (−) | (−) | − | − | − | | License | MIT | Apache-2.0 | GPL-3.0 | None | GPL-3.0 | GPL-3.0 | AFL | AFL | : Comparison of nf-core/funcscan with other related pipelines for ARG, AMP, and BGC discovery. Parentheses indicate either unspecific gene screening or partly fulfilled criteria. \label{tab:pipelines} @@ -200,8 +212,9 @@ Reasonable default parameters for commonly tuned parameters of the screening too All screening tools of nf-core/funcscan have heterogeneous output formats and label their respective gene predictions differently. nf-core/funcscan aggregates the output of all gene and taxonomic screening tools in each executed subworkflow into single human- and machine-readable tables in CSV format per gene type using dedicated tools. For the summary of ARGs we have used the existing hAMRonization software. -For AMPs AMPcombi parse the results of AMP prediction tools and summarise them into single tables, and aligns the AMP hits against a reference AMP database for deeper functional classification. -We wrote a custom script 'comBGC' for aggregating and standardising the output of the BGC tools. +For AMPs, AMPcombi parses and filters the results of AMP prediction tools, summarises them into single tables, and aligns the AMP hits against a reference AMP database for deeper functional classification. +We wrote a custom script 'comBGC' for aggregating and standardising the output of the BGC tools. +These summaries are finally complemented with results from the optional taxonomic classification workflow. ## Reproducibility and scalability @@ -217,7 +230,7 @@ The performance of each pipeline run (including software versions of all applied nf-core/funcscan has developed an active user community of scientific users and developers who continuously contribute ideas, bug reports and code via issues and pull requests on GitHub. The pipeline is already being actively used in research (https://www.mdpi.com/2076-2607/14/1/145, https://link.springer.com/article/10.1007/s12602-025-10718-9, https://pmc.ncbi.nlm.nih.gov/articles/PMC12051446/, https://link.springer.com/article/10.1007/s12223-026-01445-x). Additionally, the pipeline received a contribution of a whole new workflow (CAZyme screening) by new community members outside of the original developers. -Discussions of pipeline as well as research domain related topics happen on the open-to-join nf-core workspace on the Slack platform. This illustrates the public interest and proactive efforts from scientific users to use, maintain, and improve the pipeline.functionalities. +Discussions of pipeline as well as research domain related topics happen on the open-to-join nf-core workspace on the Slack platform. This illustrates the public interest and proactive efforts from scientific users to use, maintain, and improve the pipeline functionalities. # AI usage disclosure @@ -231,7 +244,7 @@ We thank Martin Klapper and Rosa Herbst for helpful feedback on relevant BGC and J.F. received a fellowship from the International Leibniz Research School (under the head of the Jena School for Microbial Communication, JSMC). This project was funded by grants from the Werner Siemens Foundation (Paleobiotechnology to C.W. and P.S.) and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, under Germany’s Excellence Strategy – EXC 2051 – Project-ID 390713860 to C.W. and P.S.). -J.A.F.Y and C.W. was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 460129525 (NFDI4Microbiota, FlexFund project EnterArchaeo). +J.A.F.Y and C.W. were funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 460129525 (NFDI4Microbiota, FlexFund project EnterArchaeo). This work was supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A). # References From 862dc59fe42687624fea1179bedff1947d738b73 Mon Sep 17 00:00:00 2001 From: nf-core-bot Date: Wed, 29 Apr 2026 14:33:27 +0000 Subject: [PATCH 6/7] [automated] Fix code linting --- paper/paper.md | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/paper/paper.md b/paper/paper.md index bdad7b6c..fb6538dc 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -151,19 +151,19 @@ Regarding pipeline stability and reliability, nf-core/funcscan is the only pipel | Local installation possible | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | | Web-based execution possible | (✓) | (✓) | (✓) | ✗ | ✗ | ✗ | ✗ | ✗ | | Software reviewing | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | -| ARG screening | + | + | + | + | + | (+) | (+) | + | -| AMP screening | + | − | − | − | − | (+) | (+) | − | -| BGC screening | + | + | − | − | − | (−) | (−) | − | -| CAZyme screening | + | + | − | − | − | − | − | − | -| Taxonomic assignment of contigs | + | − | − | (−) | (−) | + | + | − | -| Results summary | + | + | + | (+) | (+) | + | + | − | -| Container support (Docker, Singularity) | + | + | + | − | − | − | + | (−) | -| Modularity | + | + | + | − | + | (+) | − | − | -| One-click installation | + | + | + | − | − | (−) | − | − | -| Local installation possible | + | + | + | + | + | + | + | − | -| Web-based execution possible | (+) | (+) | (+) | − | − | − | − | − | -| Software reviewing | + | + | − | − | − | − | − | − | -| Automated unit tests | + | + | (−) | (−) | (−) | − | − | − | +| ARG screening | + | + | + | + | + | (+) | (+) | + | +| AMP screening | + | − | − | − | − | (+) | (+) | − | +| BGC screening | + | + | − | − | − | (−) | (−) | − | +| CAZyme screening | + | + | − | − | − | − | − | − | +| Taxonomic assignment of contigs | + | − | − | (−) | (−) | + | + | − | +| Results summary | + | + | + | (+) | (+) | + | + | − | +| Container support (Docker, Singularity) | + | + | + | − | − | − | + | (−) | +| Modularity | + | + | + | − | + | (+) | − | − | +| One-click installation | + | + | + | − | − | (−) | − | − | +| Local installation possible | + | + | + | + | + | + | + | − | +| Web-based execution possible | (+) | (+) | (+) | − | − | − | − | − | +| Software reviewing | + | + | − | − | − | − | − | − | +| Automated unit tests | + | + | (−) | (−) | (−) | − | − | − | | License | MIT | Apache-2.0 | GPL-3.0 | None | GPL-3.0 | GPL-3.0 | AFL | AFL | : Comparison of nf-core/funcscan with other related pipelines for ARG, AMP, and BGC discovery. Parentheses indicate either unspecific gene screening or partly fulfilled criteria. \label{tab:pipelines} @@ -212,9 +212,9 @@ Reasonable default parameters for commonly tuned parameters of the screening too All screening tools of nf-core/funcscan have heterogeneous output formats and label their respective gene predictions differently. nf-core/funcscan aggregates the output of all gene and taxonomic screening tools in each executed subworkflow into single human- and machine-readable tables in CSV format per gene type using dedicated tools. For the summary of ARGs we have used the existing hAMRonization software. -For AMPs, AMPcombi parses and filters the results of AMP prediction tools, summarises them into single tables, and aligns the AMP hits against a reference AMP database for deeper functional classification. -We wrote a custom script 'comBGC' for aggregating and standardising the output of the BGC tools. -These summaries are finally complemented with results from the optional taxonomic classification workflow. +For AMPs, AMPcombi parses and filters the results of AMP prediction tools, summarises them into single tables, and aligns the AMP hits against a reference AMP database for deeper functional classification. +We wrote a custom script 'comBGC' for aggregating and standardising the output of the BGC tools. +These summaries are finally complemented with results from the optional taxonomic classification workflow. ## Reproducibility and scalability From c46190f1b5df63ecff49e3329fa299385116e028 Mon Sep 17 00:00:00 2001 From: Jasmin Frangenberg <73216762+jasmezz@users.noreply.github.com> Date: Wed, 29 Apr 2026 16:59:08 +0200 Subject: [PATCH 7/7] Update tools table --- paper/paper.md | 12 ------------ 1 file changed, 12 deletions(-) diff --git a/paper/paper.md b/paper/paper.md index fb6538dc..7bfa4770 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -139,18 +139,6 @@ Regarding pipeline stability and reliability, nf-core/funcscan is the only pipel | Feature | funcscan | mettannotator | bacannot | HT-ARGfinder | PathoFact | SqueezeMeta | MetaERG | ARGs-OAP | | --------------------------------------- | -------- | ------------- | -------- | ------------ | --------- | ----------- | ------- | -------- | -| ARG screening | ✓ | ✓ | ✓ | ✓ | ✓ | (✓) | (✓) | ✓ | -| AMP screening | ✓ | ✗ | ✗ | ✗ | ✗ | (✓) | (✓) | ✗ | -| BGC screening | ✓ | ✓ | ✗ | ✗ | ✗ | (✗) | (✗) | ✗ | -| CAZyme screening | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | -| Taxonomic assignment of contigs | ✓ | ✗ | ✗ | (✗) | (✗) | ✓ | ✓ | ✗ | -| Results summary | ✓ | ✓ | ✓ | (✓) | (✓) | ✓ | ✓ | ✗ | -| Container support (Docker, Singularity) | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | (✗) | -| Modularity | ✓ | ✓ | ✓ | ✗ | ✓ | (✓) | ✗ | ✗ | -| One-click installation | ✓ | ✓ | ✓ | ✗ | ✗ | (✗) | ✗ | ✗ | -| Local installation possible | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | -| Web-based execution possible | (✓) | (✓) | (✓) | ✗ | ✗ | ✗ | ✗ | ✗ | -| Software reviewing | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | | ARG screening | + | + | + | + | + | (+) | (+) | + | | AMP screening | + | − | − | − | − | (+) | (+) | − | | BGC screening | + | + | − | − | − | (−) | (−) | − |