DNMB

The domestication of non-model bacteria (DNMB) presents a unique set of challenges and opportunities. To address these, we introduce a novel approach—the DNMB pipeline—a comprehensive genomics analysis tool tailored specifically for non-model bacterial species. Unlike traditional model organisms, non-model bacteria often lack well-defined genetic resources and established analytical pipelines. Consequently, researchers face hurdles in elucidating the genetic basis underlying phenotypic traits relevant to domestication efforts. Our pipeline integrates multiple scripts and computational tools to streamline various stages of genomic analysis, from making comprehensive functional annotation in user-friendly table format to genomic features including codon usage and ribosomal binding site preference and distance. Herein, we provide a detailed overview of the DNMB pipeline, highlighting its key components and functionalities. Additionally, we demonstrate its utility through a case study involving the domestication of a non-model bacterial strain. The DNMB pipeline not only accelerates genomic analysis but also enhances our understanding of non-model bacterial physiology, thereby facilitating the exploitation of microbial diversity for biotechnological applications.

Key Features

Diverse Scripts: Our collection includes a range of scripts, each developed to address unique challenges in systems biology research.

Curated Database: Access to a comprehensive database, meticulously compiled to support and enhance your research projects. We provide a package to facilitate tabulating data from various databases including REBASE, MEROPS, and CAZy_dbCAN3. The tables, which can be easily converted into FASTA format, allow for seamless integration with various sequence analysis tools, providing flexibility and ease of use for researchers. enabling users to extract desired information using various sequence analysis tools, including BLAST.

User-Friendly Documentation: Detailed documentation is available to guide you through the installation, setup, and utilization of both the scripts and the database.

Algorithms for analysis

Ribosomal binding site:

The RBS algorithm begins by extracting the last 9 nucleotides of the full-length 16S ribosomal RNA. It then identifies all instances of the reverse complement sequence within the genome, allowing up to 2 mismatches. From these, it selects sequences based on their proximity to an annotated start codon—typically within a range of 1 to 10 nucleotides (this range is the default setting). Finally, it statistically calculates the preference for selected RBS sequences and the distance (spacer) between the start codon and the RBS sequence.

Codon usage:

Codon usage analysis quantifies the total count of amino acids and nucleotides (codons) across the entire coding sequence. This analysis is performed using the uco function from the seqinr package, calculating three key indices: eff for codon counts, freq for relative frequencies of codons, and rscu for the Relative Synonymous Codon Usage index. The terms "eff", "freq", and "rscu" are equivalent to "R0", "R1", and "R3", respectively, as defined in Suzuki et al. (2005) under the section "2.2 Normalization of codon usage data". Furthermore, "eff" and "rscu" correspond to "AF" and "RSCU", respectively, in Suzuki et al. (2008) "2.2. Definitions of codon usage data".

Getting Started:

To begin using our resources, please follow the steps outlined in our documentation. Whether you're looking to integrate our scripts into your existing projects or explore our database for new insights, we've provided all the necessary instructions to get you started.

Installation

Requirements

The DNMB is supported for macOS, Linux and Windows machines, which can provide an environment for using R. It requires R version >=4.2.1 for release, and R version >=4.3 for devel.

One of the third-party functionalities is not available for Windows and MacOS machines (InterProScan).

The EggNOG-mapper webserver, allows users to input sequences in FASTA format based on locus_tag identifiers and receive results in either XLSX or CSV format. Additionally, the standalone version available on GitHub is compatible with DNMB.

InterProScan requires a Linux operating system. Without access to Linux, you can proceed with the analysis up to Eggnog-mapper in the annotation stage, but you won't be able to obtain information about motif analysis.

To download and install R, see the R-project website.

To download and install InterProScan, see the InterProScan github.

To download and install EggNOG-mapper, see the EggNOG-mapper github.

Warning

The basic file for genomic analysis, known as a GenBank file, requires both sequence and annotation in full-format files such as gbff, gb, or gbk. Additionally, GenBank prefers a format based on the GeneMarkS2+ pipeline, and using a different annotation pipeline to obtain GenBank files may lead to errors.

Anaylsis flow

Prerequisites

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("Biostrings", "ComplexHeatmap", "Peptides"))

install.packages(c("qdap", "seqinr", "stringr", "stringi", "splitstackshape", "gtools", "ggplot2", "ggseqlogo", "circlize", "grid", "gridExtra","plyr", "dplyr", "tidyr", "readr", "reshape2", "data.table", "tibble", "qdap", "openxlsx"))

Note: If you encounter issues installing the qdap package, try installing it with the following command:

install.packages("qdap", INSTALL_opts = "--no-multiarch")

Note: Java and the rJava package must be installed and configured to enable .xlsx output using this package.

Install Java Development Kit (JDK): Download and install the appropriate JDK for your operating system from the Oracle website or OpenJDK.
Install rJava Package in R:

install.packages("rJava")
library(rJava)

Set $JAVA_HOME Path:

You need to set the environment variable JAVA_HOME to point to the location of your JDK installation.

• On Windows:

Install Java jdk (https://www.oracle.com/kr/java/technologies/downloads/)
Check the “System Variables,” : • Variable name: JAVA_HOME • Variable value: The path to your JDK installation

# Print the current value of the JAVA_HOME environment variable.
echo %JAVA_HOME%  #(e.g., C:\Program Files\Java\jdk-18)

# Set the JAVA_HOME environment variable to point to the Java Development Kit (JDK) installation.
# Replace [version] with your installed JDK version (e.g., jdk-18).
setx JAVA_HOME "C:\Program Files\Java\jdk[version]"

# Update the system PATH to include the bin directory of the JDK.
setx PATH "%JAVA_HOME%\bin;%path%"

# Check the installed Java version to confirm that the correct version is being used.
JAVA -version

Restart R or RStudio.

• On macOS/Linux: -If Xcode is not installed, you may encounter compiler issues during package installation. To resolve this, install Xcode from the App Store. Add this line to your .bash_profile or .bashrc (depending on the shell):

# Navigate to your Java installation directory to check available Java versions
/Library/Java/JavaVirtualMachines/[my_java_folder]/Contents/Home # check my java list

# Open your .bash_profile (or .bashrc) file for editing
vi ~/.bash_profile # edit bash profile

# Press 'i' to enter insert mode in the vi editor
i # insert mode

# Add or update the JAVA_HOME environment variable with the path to your Java installation
export JAVA_HOME=/Library/Java/JavaVirtualMachines/[my_java_folder]/Contents/Home
# Add Java's bin directory to the system PATH variable so that Java commands can be run from the terminal
export PATH=${PATH}:$JAVA_HOME/bin

# Save the changes and exit the vi editor. ":wq!" means "write" (save) and "quit" (exit) forcefully
: # activate command line
wq! # save

# Apply the changes made to the .bash_profile or .bashrc immediately (without needing to restart the terminal)
source ~/.bash_profile  ## or ~/.bashrc #apply changes

# Verify that JAVA_HOME is set correctly by printing its value
echo $JAVA_HOME # validation

Install DNMB R package

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")
    
devtools::install_github("JAEYOONSUNG/DNMB")

Quick start

Run DNMB analysis

setwd([GenBank directory]) # Set the working directory to the location where your GenBank files are stored.
library(DNMB)
run_DNMB()

Run via Docker

The published image ghcr.io/jaeyoonsung/dnmbsuite:latest bundles every external binary the modules call (HMMER, BLAST, DIAMOND, Prodigal, ViennaRNA, eggnog-mapper, padloc, PhiSpy, defense-finder, AcrFinder, PromoTech, …) plus the DNMB R package. Run it from a directory that contains a single GenBank file (*.gbff / *.gbk / *.gb):

docker pull ghcr.io/jaeyoonsung/dnmbsuite:latest

docker run --rm \
  --user "$(id -u):$(id -g)" \
  -v "$PWD:/data" \
  -v "$HOME/.dnmb-cache:/opt/dnmb-cache" \
  ghcr.io/jaeyoonsung/dnmbsuite:latest

-v "$PWD:/data" — the per-genome working directory; outputs land here under dnmb_module_*/ and visualizations/.
-v "$HOME/.dnmb-cache:/opt/dnmb-cache" — persistent module cache. HMM/sequence databases, PromoTech models, EggNOG data, PADLOC HMMs etc. are downloaded into this directory once and reused across every subsequent run on any machine that mounts the same path.
The container's WORKDIR /data and CMD ["Rscript","-e","suppressWarnings(suppressPackageStartupMessages(library(DNMB))); run_DNMB()"] drive a full single-genome pipeline (the suppress wrappers keep the library() attach banners and S4 dispatch warnings out of the run log; [DNMB] … status messages are unaffected). To override flags, pass them through:
```
docker run --rm \
  --user "$(id -u):$(id -g)" \
  -v "$PWD:/data" \
  -v "$HOME/.dnmb-cache:/opt/dnmb-cache" \
  ghcr.io/jaeyoonsung/dnmbsuite:latest \
  Rscript -e 'library(DNMB); run_DNMB(module_VirSorter2 = TRUE)'
```

Module defaults inside the image

run_DNMB() runs the following modules unconditionally on first call:

Default ON	Default OFF
dbCAN, MEROPS, PAZy, GapMind (AA + Carbon), DefenseFinder (+anti-defense), dbAPIS, AcrFinder, Promotech, mRNAcal, PADLOC, DefensePredictor, REBASEfinder, ISelement, PhiSpy, EggNOG, InterProScan	CLEAN, PIDE (CUDA-gated — see below), Prophage (legacy alias), VirSorter2

Override any module with the module_<Name> = TRUE/FALSE argument.

Cache layout

Each module owns a subdirectory of the cache:

~/.dnmb-cache/db_modules/
├── defensefinder/current/      # CasFinder + DefenseFinder repo
├── acrfinder/current/          # AcrFinder repo + python venv
├── promotech/current/          # PromoTech repo + RF-HOT.model
├── padloc/current/             # PADLOC HMMs + bin/padloc wrapper
├── eggnog/data/                # EggNOG DB (~50 GB)
├── gapmind/{aa,carbon}/        # GapMind PaperBLAST data
├── dbcan/, merops/, pazy/, dbapis/, rebasefinder/cache/, …
└── interproscan/, prophage/phispy/, iselement/

current/manifest.rds in each module dir tracks the installed version; DNMB skips re-download when the manifest matches. Sharing the cache across machines is just rsync ~/.dnmb-cache/ host:.dnmb-cache/.

Each [DNMB] status line surfaces the live manifest version it resolved, so you can see at a glance which database/tool release the run is bound to, e.g.

[DNMB] eggnog (data, 2026-04-08) [db=5.0.2, emapper=2.1.12]
[DNMB] padloc (current, 2026-04-07) [db=v2.0.0]
[DNMB] defensefinder (current, 2026-04-25) [models=2.0.2, casfinder=3.1.0, casfinder_models=2.0]
[DNMB] dbcan (current, 2026-04-01) [release=V14, dbcan=8/26/2025, cazydb=7/10/2025]
[DNMB] promotech (current, 2026-04-25) [model=RF-HOT]
[DNMB] mrnacal (embedded — runs from R package, no cache install needed)

Re-run reuse

Every successful module drops a dnmb_module_<m>/.dnmb_<m>_complete sentinel at the end of its run. The next invocation reads <workdir>/.dnmb_module_stage_cache.rds and skips any module whose sentinel + cached module_results are still present and whose genbank/db fingerprints have not changed:

[DNMB] Reusing existing module outputs for: dbAPIS, dbCAN, DefenseFinder,
  DefensePredictor, EggNOG, GapMind, ISelement, MEROPS, mRNAcal,
  PAZy, PhiSpy, REBASEfinder
[DNMB] Running remaining modules: AcrFinder, Promotech, PADLOC

Modules that produced 0 hits also cache-hit because the sentinel is written regardless of hit count. PromoTech additionally skips its Python prediction step when <workdir>/dnmb_module_promotech/promotech_runs/<contig>/genome_predictions.csv is newer than the contig FASTA, and prunes the multi-GB RF-HOT*.data / SEQS*.data intermediates after each successful run.

Module outputs of note

Module	Headline file	Where
mRNAcal	`mRNAcal_translation_efficiency_summary.pdf`	`<workdir>/visualizations/`
mRNAcal (full per-gene table)	`translation_efficiency_summary.xlsx`, `translation_efficiency_distribution.pdf`, `translation_efficiency_rbs_detail.tsv`	`<workdir>/dnmb_module_mrnacal/`
ISelement	`ISelement_overview.pdf`	`<workdir>/visualizations/`
REBASEfinder	`REBASE_overview.pdf`, `R-M_REBASE_analysis.xlsx`, `DNMB_REBASEfinder_augmented_hits.tsv`, `DNMB_REBASEfinder_motif_hits.tsv`, `DNMB_REBASEfinder_structure_queries.faa`, `DNMB_REBASEfinder_structure_coverage.tsv`	`<workdir>/visualizations/`, `<workdir>/dnmb_module_rebasefinder/`
Combined Excel	`*_total.xlsx` (one row per locus_tag, every module's columns + per-module sheets, mRNAcal_full sheet, IS_census, Landing_pads, …)	`<workdir>/`

REBASEfinder structural validation

REBASEfinder writes DNMB_REBASEfinder_structure_queries.faa for candidate R-M proteins and DNMB_REBASEfinder_structure_coverage.tsv to show which candidates actually have query structures and Foldseek hits. Queries without a structure are also written to DNMB_REBASEfinder_structure_missing_queries.faa; use that FASTA to fill the gaps before treating Foldseek validation as complete. You can predict structures for those candidates, compare them to the bundled R-M reference structures with Foldseek, and merge the resulting secondary evidence back into DNMB:

Rscript inst/scripts/rebasefinder_prepare_structure_refs.R \
  --out refs

Rscript inst/scripts/rebasefinder_fetch_alphafold_structures.R \
  --queries <workdir>/dnmb_module_rebasefinder/DNMB_REBASEfinder_structure_queries.faa \
  --metadata <workdir>/dnmb_module_rebasefinder/rebasefinder_input.tsv \
  --out-dir query_structures

Rscript inst/scripts/rebasefinder_esmfold_predict.R \
  --in <workdir>/dnmb_module_rebasefinder/DNMB_REBASEfinder_structure_queries.faa \
  --out-dir query_structures \
  --limit 50

Rscript inst/scripts/rebasefinder_foldseek_validate.R \
  --query query_structures \
  --target refs \
  --out <workdir>/dnmb_module_rebasefinder/foldseek_results.tsv \
  --threads 4

Rscript inst/scripts/rebasefinder_verify_motif_structures.R \
  --motifs <workdir>/dnmb_module_rebasefinder/DNMB_REBASEfinder_motif_hits.tsv \
  --structures-dir query_structures \
  --out <workdir>/dnmb_module_rebasefinder/DNMB_REBASEfinder_motif_structure_verification.tsv

Re-run REBASEfinder with rebasefinder_structure_tsv pointing to the Foldseek TSV, or leave foldseek_results.tsv in the REBASEfinder module directory for automatic pickup. Structure-supported calls are added to DNMB_REBASEfinder_augmented_hits.tsv and marked in REBASE_overview.pdf. Role-relevant motif regex hits, partial/short-sequence flags, and structural support status are written to DNMB_REBASEfinder_motif_hits.tsv; all raw regex hits, including role-inappropriate diagnostic matches, are written to DNMB_REBASEfinder_motif_hits_raw.tsv. When query structures are available, rebasefinder_verify_motif_structures.R checks each listed motif range for modeled-residue coverage, local pLDDT, and short-motif CA-span distance.

GPU-gated defaults (CLEAN and PIDE)

run_DNMB() probes for an NVIDIA GPU via nvidia-smi -L at call time. module_CLEAN and module_PIDE default to TRUE only when a CUDA device is detected; they default to FALSE otherwise. This keeps CPU-only runs fast (both modules use large neural models — CLEAN's LayerNormNet and PIDE's ESM-650M — and are ~50–100× slower without a GPU).

Force enable: run_DNMB(module_CLEAN = TRUE, module_PIDE = TRUE) or set Sys.setenv(DNMB_CUDA = "1") before the call.
Force disable: pass module_CLEAN = FALSE / module_PIDE = FALSE explicitly, or set DNMB_CUDA=0.
In the DNMBsuite Docker wrapper, the same probe runs on the host and automatically attaches --gpus all when CUDA is present.

With comparative heatmaps across sibling genomes

Passing comparative = TRUE runs the single-genome pipeline as usual and, at the end, renders the full suite of comparative heatmaps across every sibling folder that holds a GenBank file. By default the parent directory of getwd() is scanned; pass comparative_data_root to point elsewhere.

setwd([per-genome directory])
library(DNMB)
run_DNMB(comparative = TRUE)
# or: run_DNMB(comparative = TRUE, comparative_data_root = "/path/to/parent")

Comparative stage renders 14 heatmaps into <data_root>/comparative/:

Module	Plotter	Subtype axis
DefenseFinder	`dnmb_plot_comparative_defensefinder`	system
PADLOC	`dnmb_plot_comparative_padloc`	system
DefensePredictor	`dnmb_plot_comparative_defensepredictor`	system
REBASEfinder	`dnmb_plot_comparative_rebasefinder`	enzyme type
MEROPS	`dnmb_plot_comparative_merops`	family (C26, S8, …)
MEROPS	`dnmb_plot_comparative_merops_catalytic`	catalytic type (Cysteine, Serine, …)
dbCAN	`dnmb_plot_comparative_dbcan`	class (GH, GT, PL, …)
dbCAN	`dnmb_plot_comparative_dbcan_family`	family (GH13, GT2, …)
CGC	`dnmb_plot_comparative_cgc`	signature mix (CAZyme+TC+TF, …)
CGC	`dnmb_plot_comparative_cgc_substrate`	predicted substrate
PAZy	`dnmb_plot_comparative_pazy`	family
PhiSpy	`dnmb_plot_comparative_phispy`	region size bucket
VirSorter2	`dnmb_plot_comparative_virsorter2`	max_score_group (dsDNAphage, ssDNA, …)
PIDE	`dnmb_plot_comparative_pide`	region size bucket

Each plotter auto-runs its own module on any genome that has not been analyzed yet, so a fresh sibling folder just needs a GenBank file.

DefenseFinderViz (Optional)

DefenseFinder_Heatmap()

Note: [Strain of interest].defense_finder_systems.tsv output are used for merging data. GenBank’s SOURCE field is used for extracting names.
Note: protien coding sequence (.faa) output was used for defense-finder analysis (https://github.com/mdmparis/defense-finder)

Comparative per-module heatmaps across genomes (Optional)

The easiest entry point is run_DNMB(comparative = TRUE) from any per-genome folder — the per-genome analysis runs as usual and the full comparative suite renders against the parent directory at the end. The individual plotters below are useful when you want to render only a subset, override colors, or point at a non-sibling parent directory.

Point each plotter at a parent directory containing one subfolder per genome. Every subfolder that holds a GenBank file (*.gbff / *.gbk / *.gb) is treated as a genome. Genomes missing the relevant module output are analyzed on the fly (auto_run_missing = TRUE, default) — each plotter triggers only its own module via run_module_set(db = ...), not the full DNMB pipeline. Genomes whose module has already run are read from disk; genomes that truly have no hits still render as empty rows so "analyzed, empty" is distinguishable from "not yet analyzed".

library(DNMB)

data_root <- "/path/to/parent-dir-of-genome-folders"

# Defense-system heatmaps (purple palette)
dnmb_plot_comparative_defensefinder(data_root)    # DefenseFinder
dnmb_plot_comparative_padloc(data_root)           # PADLOC
dnmb_plot_comparative_defensepredictor(data_root) # DefensePredictor
dnmb_plot_comparative_rebasefinder(data_root)     # REBASEfinder

# Enzyme / CAZyme heatmaps (module-specific palettes)
dnmb_plot_comparative_merops(data_root)             # MEROPS family (C26, S8, …)
dnmb_plot_comparative_merops_catalytic(data_root)   # MEROPS catalytic type (Cysteine, Serine, …)
dnmb_plot_comparative_dbcan(data_root)              # dbCAN class (GH, GT, PL, …)
dnmb_plot_comparative_dbcan_family(data_root)       # dbCAN family (GH13, GT2, …)
dnmb_plot_comparative_cgc(data_root)                # CGC signature mix (CAZyme+TC+TF, …)
dnmb_plot_comparative_cgc_substrate(data_root)      # CGC substrate (starch, melibiose, …)
dnmb_plot_comparative_pazy(data_root)               # PAZy families

# Prophage heatmaps (purple palette)
dnmb_plot_comparative_phispy(data_root)     # PhiSpy regions bucketed by size
dnmb_plot_comparative_virsorter2(data_root) # VirSorter2 max_score_group
dnmb_plot_comparative_pide(data_root)       # PIDE regions bucketed by size

Outputs are written under <data_root>/comparative/ as Comparative_<Module>_Heatmap.pdf alongside the underlying count matrix.

Pass auto_run_missing = FALSE to skip the on-the-fly analysis and only render what already exists.

EggNOG-mapper (Optional)

emapper.py --cpu 20 --mp_start_method forkserver --data_dir [eggnog_data directory] -o out --output_dir [eggnog_output] --temp_dir [eggnog_output] --override -m diamond --dmnd_ignore_warnings --dmnd_algo ctg -i [fasta] --evalue 0.001 --score 60 --pident 40 --query_cover 20 --subject_cover 20 --itype proteins --tax_scope auto --target_orthologs all --go_evidence non-electronic --pfam_realign none --report_orthologs --decorate_gff yes --excel

Note: [Strain of interest].emapper.annotations.xlsx or [Strain of interest]emapper.annotations.csv output are used for merging data.

InterProScan (Optional)

./interproscan.sh -i [input_file] -f tsv -iprlookup -etra -goterms -pa -cpu 20

Note: Files with [Strain of interest].tsv and [Strain of interest].tsv.sites extensions are used for merging data.

Promotech (Optional)

https://github.com/BioinformaticsLabAtMUN/PromoTech

Promotech can be appended as a DNMB module. It is disabled by default because the upstream RF-HOT/RF-TETRA models are large and are not bundled with DNMB. DNMB caches Promotech runtime files under the module cache (DNMB_CACHE_ROOT or ~/.dnmb-cache). If live prediction is requested without a precomputed predictions file, the selected model is downloaded into that same cache unless promotech_download_model = FALSE. Each run writes dnmb_module_promotech/promotech_promoter_feature_for_gb for copying into a GenBank FEATURES block and, when a GenBank input is available, dnmb_module_promotech/promotech_promoters_annotated.gbk for SnapGene import. Promoter feature labels include both the stable Promotech id and score, for example Promotech_000001 (score=0.62883).

# Import a precomputed Promotech genome_predictions.csv/TSV file
run_DNMB(
  module_Promotech = TRUE,
  promotech_predictions = "genome_predictions.csv",
  promotech_threshold = 0.6
)

# Or run only the Promotech module against the active genbank_table
run_module_set(
  db = "Promotech",
  promotech_predictions = "genome_predictions.csv",
  merge = TRUE
)

# Live prediction: caches the Promotech repo and selected model first
run_DNMB(
  module_Promotech = TRUE,
  promotech_model = "RF-HOT",
  promotech_threshold = 0.6
)

Note: Live Promotech execution requires the upstream Python dependencies and enough RAM. Precomputed genome_predictions.csv import works without downloading the model or running Promotech itself.
Runtime: RF-HOT live prediction is intentionally heavy because upstream Promotech scans every 40-nt window on both strands. A 2.36 Mb bacterial genome took about 26 minutes and roughly 12 GB RAM in Docker during local validation; DNMB promoter-to-gene mapping and GenBank/SnapGene artifact generation then completed in about 15 seconds.

mRNAcal / Translation Initiation (Optional)

mRNAcal can be appended as a DNMB module with module_mRNAcal = TRUE. It scores each protein-coding gene's transcript initiation region using RBS motif quality, RBS-to-start spacing, anti-Shine-Dalgarno duplex energy, RNAplfold local accessibility, upstream A/U-rich enhancer context, start codon, N-terminal coding sequence context (including early lysine/AAA-AAG codons and the first 45 bp after ATG), and RNAfold MFE. The default extraction window is 60 nt upstream and 60 nt downstream of the start codon in the transcript direction, but RNAplfold is run on a smaller local TIR window around the start codon for speed and biological specificity.

DNMB uses ViennaRNA RNAfold for fast batch MFE folding, RNAplfold for RBS, start, standby-site, and start-centered TIR unpaired probabilities, and RNAduplex for anti-SD:RBS binding energy. Docker builds ViennaRNA from source (VIENNARNA_VERSION=2.7.2 by default), so these tools are available on PATH; local runs can pass mrnacal_rnafold_path = "/path/to/RNAfold" when needed, and sibling RNAplfold/RNAduplex binaries are detected automatically. Results are written to dnmb_module_mrnacal/mrnacal_translation_efficiency.tsv, appended to DNMB_table.xlsx with the mRNAcal_ prefix, and visualized under visualizations/ as mRNAcal_translation_efficiency.pdf plus top fold arc diagrams in mRNAcal_top_folds.pdf.

run_DNMB(module_mRNAcal = TRUE)

run_module_set(
  db = "mRNAcal",
  mrnacal_upstream = 60,
  mrnacal_downstream = 60,
  merge = TRUE
)

Note: If a 16S/SSU rRNA annotation is present, DNMB derives the Shine-Dalgarno seed from its 3' tail; otherwise it falls back to AGGAGG for bacteria and GGAGG for archaea. Use mrnacal_sd_seed = "AGGAGG" to force a specific motif.
Scoring: The composite score prioritizes local TIR accessibility: 0.20*RBS + 0.18*antiSD + 0.32*RNAplfold local accessibility + 0.10*upstream A/U + 0.10*start + 0.07*early coding context + 0.03*MFE. The early coding-context term follows the N-terminal coding sequence idea of Tian et al. (doi:10.1016/j.ymben.2019.07.001) and reports both codon 2-8 lysine signals and NCS45 features. Long homopolymeric AAA/poly-A runs are penalized rather than blindly rewarded.
Runtime: The module folds short TIR windows in ViennaRNA batches, so it is much lighter than full transcriptome folding.

Contributing

We welcome contributions from the community! If you have suggestions for improvements, additional scripts, or updates to the database, please see our contributing guidelines for more information on how to get involved.

License

This project is released under MIT licence, which allows for both personal and commercial use, modification, and distribution of our work, provided that proper credit is given.

We hope our resources will prove invaluable to your research in systems biology. For any questions or feedback, please don't hesitate to reach out through our GitHub issues or contact section.

Citation

If you use this piepline, please cite:

[DNMB] DNMB: Programmable domestication of thermophilic bacteria through removal of non-canonical defense systems.
			 Sung, J.Y., Lee, M.H., Park, J.S., Kim, H.B., Ganbat, D., Kim, D.G., Cho, H.W., Suh, M.K., Lee, J.S., Lee, S.J., Kim, S.B.*, and Lee, D.W.*.
			 *bioRxiv* 2026.03.21.173436. (2026)

Please, cite also the underlying algorithm/database if it was used for the search step of DNMB:

  [EggNOG-mapper v2]    eggNOG-mapper v2: Functional annotation, orthology assignments, and domain prediction at
                        the metagenomic scale. Carlos P. Cantalapiedra, Ana Hernandez-Plaza,
                        Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021. Molecular Biology and Evolution
                        38(12):5825-5829. https://doi.org/10.1093/molbev/msab293

  [CLEAN]               Enzyme function prediction using contrastive learning. Tianhao Yu, Haiyang Cui, Jianan Canal Li,
                        Yunan Luo, Guangde Jiang, Huimin Zhao. 2023. Science 379(6639):1358-1363. 
                        https://doi.org/10.1126/science.adf2465

  [InterProScan]        InterProScan 5: genome-scale protein function classification.
                        Philip Jones, David Binns, Hsin-Yu Chang, Matthew Fraser, Weizhong Li, Craig McAnulla,
                        Hamish McWilliam, John Maslen, Alex Mitchell, Gift Nuka, Sebastien Pesseat, Antony F. Quinn,
                        Amaia Sangrador-Vegas, Maxim Scheremetjew, Siew-Yit Yong, Rodrigo Lopez, Sarah Hunter.
                        2014. Bioinformatics 30(9):1236-1240. https://doi.org/10.1093/bioinformatics/btu031

  [DefenseFinder]       DefenseFinder: Systematic and quantitative view of the antiviral arsenal of prokaryotes.
                        Florian Tesson, Alexandre Herve, Ernest Mordret, Marie Touchon, Camille d'Humieres, Jean Cury,
                        Aude Bernheim. 2022. Nature Communications 13:2561. https://doi.org/10.1038/s41467-022-30269-9

  [REBASE]              REBASE-a database for DNA restriction and modification: enzymes, genes and genomes.
                        Richard J. Roberts, Tamas Vincze, Janos Posfai, Dana Macelis. 2010. Nucleic Acids Research
                        38(Database issue):D234-D236. https://doi.org/10.1093/nar/gkp874

  [GapMindAA]           GapMind: Automated annotation of amino acid biosynthesis.
                        Morgan N. Price, Adam M. Deutschbauer, Adam P. Arkin. 2020. mSystems 5(3):e00291-20.
                        https://doi.org/10.1128/mSystems.00291-20

  [GapMindCarbon]       Filling gaps in bacterial catabolic pathways with computation and high-throughput genetics.
                        Morgan N. Price, Adam M. Deutschbauer, Adam P. Arkin. 2022. PLoS Genetics 18(4):e1010156.
                        https://doi.org/10.1371/journal.pgen.1010156

  [MEROPS]              The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a
                        comparison with peptidases in the PANTHER database. Neil D. Rawlings, Alan J. Barrett,
                        Paul D. Thomas, Xiaosong Huang, Alex Bateman, Robert D. Finn. 2018. Nucleic Acids Research
                        46(D1):D624-D632. https://doi.org/10.1093/nar/gkx1134

  [dbCAN]               dbCAN3: automated carbohydrate-active enzyme and substrate annotation.
                        Jinfang Zheng, Qiwei Ge, Yuchen Yan, Xinpeng Zhang, Le Huang, Yanbin Yin. 2023.
                        Nucleic Acids Research 51(W1):W115-W121. https://doi.org/10.1093/nar/gkad328

  [PAZy]                Plastics degradation by hydrolytic enzymes: The plastics-active enzymes database-PAZy.
                        Patrick C. F. Buchholz, Golo Feuerriegel, Hongli Zhang, Pablo Perez-Garcia,
                        Lena-Luisa Nover, Jennifer Chow, Wolfgang R. Streit, Jurgen Pleiss. 2022.
                        Proteins 90(7):1443-1456. https://doi.org/10.1002/prot.26325

  [ISelement]           ISEScan: automated identification of insertion sequence elements in prokaryotic genomes.
                        Zhiqun Xie, Haixu Tang. 2017. Bioinformatics 33(21):3340-3347.
                        https://doi.org/10.1093/bioinformatics/btx433

						ISfinder: the reference centre for bacterial insertion sequences. 
                        Philippe Siguier, Jerome Perochon, Lucie Lestrade, Jacques Mahillon,
                        Michael Chandler. 2006. Nucleic Acids Research 34(Database issue):D32-D36.
                        https://doi.org/10.1093/nar/gkj014

  [PhiSpy]              PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity-
                        and composition-based strategies. Sajia Akhter, Rashedul Aziz,
                        Robert A. Edwards. 2012. Nucleic Acids Research 40(16):e126. https://doi.org/10.1093/nar/gks406

  [VirSorter2]          VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses.
                        Jiarong Guo, Ben Bolduc, Ahmed A Zayed, Arvind Varsani, Guillermo Dominguez-Huerta, Tom O Delmont,
                        Akbar Adjie Pratama, M Consuelo Gazitúa, Dean Vik, Matthew B Sullivan, Simon Roux. 2021. Microbiome 9:37.
                        https://doi.org/10.1186/s40168-020-00990-y

  [PIDE]                PIDE: a deep learning-based tool for prophage identification using genome-wide features.
                        https://github.com/BackofenLab/PIDE

Name		Name	Last commit message	Last commit date
Latest commit History 331 Commits
R		R
inst		inst
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.dockerignore		.dockerignore
.gitignore		.gitignore
CLAUDE_HANDOFF_mRNAcal_local_TIR_NCS45.md		CLAUDE_HANDOFF_mRNAcal_local_TIR_NCS45.md
DESCRIPTION		DESCRIPTION
DNMB.Rproj		DNMB.Rproj
Dockerfile		Dockerfile
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DNMB

Key Features

Algorithms for analysis

Ribosomal binding site:

Codon usage:

Getting Started:

Installation

Requirements

Warning

Anaylsis flow

Prerequisites

Install DNMB R package

Quick start

Run DNMB analysis

Run via Docker

Module defaults inside the image

Cache layout

Re-run reuse

Module outputs of note

REBASEfinder structural validation

GPU-gated defaults (CLEAN and PIDE)

With comparative heatmaps across sibling genomes

Contributing

License

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DNMB

Key Features

Algorithms for analysis

Ribosomal binding site:

Codon usage:

Getting Started:

Installation

Requirements

Warning

Anaylsis flow

Prerequisites

Install DNMB R package

Quick start

Run DNMB analysis

Run via Docker

Module defaults inside the image

Cache layout

Re-run reuse

Module outputs of note

REBASEfinder structural validation

GPU-gated defaults (CLEAN and PIDE)

With comparative heatmaps across sibling genomes

Contributing

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages