Conversation
This feature will help to reduce the run speed of EXTRACT_HUMAN_VIRUS_READS by replacing aligns_to with deacon filter. For this tool to make a database that behaves similar to STAT's exact k-mer matching you must use a window size of 1.
Meat and potatoes processes to replace EXTRACT_HUMAN_VIRUS_READS
DEACON_BUILD_INDEX_FROM_STAT_K_MERS was incorrectly wired as a
per-sample process with a 9-element tuple input containing reads,
reference files, and an existing deacon index. The index build only
needs the three STAT reference files (dbss, annotation, taxids) and
produces identical output for every sample, so it should run once.
Rewrites:
- DEACON_BUILD_INDEX_FROM_STAT_K_MERS: takes 3 separate path inputs
(stat_dbss, stat_annotation, human_virus_taxlist), runs once per
pipeline invocation, outputs human_viruses.k31w1.idx. References
stat_to_deacon.rs via ${projectDir}/bin/ for container compatibility.
- DEACON_FILTER_HUMAN_VIRUS_READS: simplified from 9-element tuple to
5-element tuple (sample_id, platform, read_structure, reads, index).
Accepts interleaved input directly — deacon handles this natively,
eliminating the need for a separate split step. Fixed missing quotes
on path() output declaration and removed shell comment inside a line
continuation that would have broken the command.
This process was added to split interleaved reads into R1/R2 before deacon filtering. Deacon handles interleaved FASTQ input natively, so the split step is unnecessary overhead — it adds an extra process, extra disk I/O, and intermediate files without benefit. The rewritten DEACON_FILTER_HUMAN_VIRUS_READS accepts interleaved reads directly.
EXTRACT_HUMAN_VIRUS_READS used STAT aligns_to + seqkit grep to identify human virus reads, taking 8-17 hours per sample. This replaces it with two deacon-based processes validated across 35 samples at 100% sensitivity and 60-80x speedup (~8-15 minutes per sample). The subworkflow now: 1. Builds a deacon index from STAT k-mers (once, not per-sample) via DEACON_BUILD_INDEX_FROM_STAT_K_MERS 2. Filters reads per-sample via DEACON_FILTER_HUMAN_VIRUS_READS using the shared index, combined with each sample's reads via .combine() The output tuple shape is unchanged — tuple(sample_id, platform, read_structure, fastq) — so RUN_SPADES, MAP_READS_TO_CONTIGS, and the viral_reads emit all work without modification. Also updates conf/results.config to publish DEACON_FILTER_HUMAN_VIRUS_READS output to the same human_virus_reads directory.
rust-script is not available for aarch.
…US_READS
Deacon's pair-aware mode requires interleaved input via stdin (- -)
to count k-mer hits from both mates jointly. When given a single file
positionally, deacon treats it as single-end. The process now branches
on read_structure to use the correct invocation:
- Interleaved (Illumina): zcat | deacon filter ... - -
Pair-aware k-mer counting where hits from either mate contribute
to the filtering threshold for the read pair.
- Single-end (ONT): deacon filter ... ${reads}
Standard single-read filtering with no pairing logic.
This ensures correct sensitivity for both sequencing platforms when
virus extraction is frontloaded before preprocessing.
Virus read extraction now happens upstream in STAT_BLAST_WORKFLOW via DEACON_FILTER_HUMAN_VIRUS_READS, so PREPROCESS_CONTIGS no longer needs to perform extraction. Removes deacon process imports and the three STAT reference file take channels (ch_stat_dbss, ch_stat_annotation, ch_human_virus_taxlist). The subworkflow now takes a single channel of already-filtered and preprocessed reads and runs: RUN_SPADES -> MASK_LOW_COMPLEXITY -> FILTER_SHORT_CONTIGS. The viral_reads emit passes through the incoming reads directly since they ARE the viral reads.
Moves virus read extraction to the very start of the STAT_BLAST pipeline, running immediately after interleaving and before any preprocessing. This reduces the dataset from full metagenomic reads (~100% of data) to virus-only reads (~1-3%) before dedup, trimming, and filtering — saving ~5 hours per sample. The new pipeline order within STAT_BLAST_WORKFLOW: 1. DEACON_BUILD_INDEX_FROM_STAT_K_MERS (once, from STAT .dbss) 2. DEACON_FILTER_HUMAN_VIRUS_READS (per sample, ~30 min) 3. PREPROCESS_VIRUS_READS (aliased PREPROCESS_READS on tiny subset) 4. PREPROCESS_CONTIGS (SPAdes + mask + filter, no extraction) 5. EXTRACT_HUMAN_VIRUSES + BLAST classification The aliased PREPROCESS_READS import (as PREPROCESS_VIRUS_READS) allows main.nf to call its own PREPROCESS_READS instance for the GOTTCHA2 path without Nextflow DSL2 invocation conflicts.
STAT_BLAST_WORKFLOW now receives raw gathered reads instead of preprocessed reads, since it handles its own virus extraction and preprocessing internally on the small virus subset. GOTTCHA2 preprocessing is gated behind STAT_BLAST completion via a combine/map signal pattern. This frees cluster resources from the fast STAT_BLAST path (~30 min virus extraction + fast preprocessing) before the expensive full-dataset preprocessing starts for GOTTCHA2. Pipeline execution order: 1. GATHER_READS (interleave) — shared by all paths 2. STAT_BLAST_WORKFLOW (immediate, virus subset) — deacon → preprocess → SPAdes → BLAST 3. PREPROCESS_READS (waits for STAT_BLAST) — full dataset for GOTTCHA2 4. GOTTCHA2_WORKFLOW (after preprocessing completes) 5. CLUMPIFY_WORKFLOW (waits for both STAT_BLAST and GOTTCHA2)
SPAdes assembles virus-only contigs from a small filtered read subset, not full metagenomic datasets. The ludicrous tier (450GB+ RAM) is unnecessary. The high label (20 CPUs, 64GB RAM) is sufficient for de novo assembly of extracted virus reads.
Deacon now takes R1/R2 as separate files for paired reads and outputs interleaved FASTQ in a single pass, combining virus filtering and interleaving into one step. This eliminates the ~1 hour INTERLEAVE_PAIRS bottleneck for the STAT_BLAST path. Input is now tuple(sample_id, platform, reads, reads2, deacon_idx) where reads2 is a sentinel file "NO_R2" for single-end (ONT) samples. The process determines read_structure at runtime by checking reads2.name and branches: paired passes R1 R2 as positional args to deacon, single passes the file directly. Output tuple is the standard 4-element shape (sample_id, platform, read_structure, fastq) for PREPROCESS_READS.
…AIRS GATHER_READS now emits two channels: - ch_gathered_reads: interleaved reads (for GOTTCHA2 and CLUMPIFY) - ch_pre_interleave: raw R1/R2 tuples (for STAT_BLAST's deacon filter) INTERLEAVE_PAIRS only runs when gottcha2 or clumpify is selected. When only stat_blast is selected, interleaving is skipped entirely — deacon handles R1/R2 directly and outputs interleaved as a byproduct. This avoids wasting ~1 hour of compute on interleaving that the STAT path does not consume. Uses NvdUtils.isToolSelected() to gate the interleave step. The pre-interleave channel has mixed tuple sizes: paired reads emit (id, platform, R1, R2) while singles emit (id, platform, fastq). Downstream normalization happens in STAT_BLAST_WORKFLOW.
COUNT_READS now accepts the same R1/R2-or-sentinel tuple pattern as DEACON_FILTER_HUMAN_VIRUS_READS: (sample_id, platform, reads, reads2) where reads2 is "NO_R2" for single-end samples. For paired samples, both R1 and R2 are counted and summed to produce the total read count — matching the result previously obtained from counting a single interleaved file. For single-end, counts the one file as before. This maintains backwards-compatible read counts for LabKey reporting.
STAT_BLAST_WORKFLOW now takes pre-interleave tuples directly from GATHER_READS instead of interleaved reads. Mixed-size tuples are normalized with a sentinel file (NO_R2) so downstream processes get a fixed 4-element tuple regardless of platform: Paired: (id, platform, R1, R2) Single: (id, platform, fastq, NO_R2) COUNT_READS receives the full tuple and counts both R1+R2 for paired samples. The deacon filter step now runs before any interleaving or preprocessing, saving ~5 hours per sample (1hr interleave + 4hr dedup on full data).
STAT_BLAST_WORKFLOW receives GATHER_READS.out.ch_pre_interleave (raw R1/R2) so deacon can filter and interleave in one step. GOTTCHA2 and CLUMPIFY receive GATHER_READS.out.ch_gathered_reads (interleaved) as before. Named emit access (GATHER_READS.out.channel_name) replaces the previous unnamed single-emit pattern.
Removes GOTTCHA2, clumpify, PREPROCESS_READS imports and all gating logic. Pipeline is now: samplesheet -> GATHER_READS -> STAT_BLAST_WORKFLOW. No tool selection, no forked channels, no completion tokens.
Removes INTERLEAVE_PAIRS, MERGE_PAIRS, needs_interleave conditional, ch_gathered_reads, and dual emit channels. Emits raw R1/R2 tuples only. Deacon handles interleaving as a byproduct of virus filtering.
…ction STAT_BLAST_WORKFLOW owns its full pipeline: deacon extraction -> inlined preprocessing -> SPAdes -> BLAST. Preprocessing steps have direct param guards (params.dedup, params.trim_adapters, etc.). Removes isToolSelected gating, ch_blast_enabled signal channel, blast_selected conditionals, and the aliased PREPROCESS_READS import. Also removes when: params.tools guard from FETCH_FASTQ in sratools.nf.
Removes isToolSelected(), BLAST_ALIASES, LABKEY_GOTTCHA2_PARAMS, and validateLabkeyGottcha2(). Keeps validateLabkeyBlast() and the shared LabKey validation helper.
Removes tools, gottcha2_db, gottcha2_db_version, min_gottcha_reads, labkey_gottcha_* params, sra_human_db, and human_read_scrub.
Removes gottcha2_results, clumpify_results path params and their publishDir entries (CLUMP_READS, SCRUB_HUMAN_READS, LABKEY_UPLOAD_GOTTCHA2_FASTA, LABKEY_UPLOAD_GOTTCHA2_FULL).
Removes tools, gottcha2_db, min_gottcha_reads, and labkey_gottcha_* from NvdParams model and CLI options. Updates DbType, UploadType, and HitSource literals to remove gottcha2 variants. Cleans state.py database resolution and schema.sql CHECK constraints.
Removes tools, gottcha2_db, gottcha2_db_version, min_gottcha_reads, labkey_gottcha_*, sra_human_db, human_read_scrub properties. Renames schema file to v3.0.0 to match the major version bump.
Removes the GOTTCHA2 workflow, its LabKey integration, hit registration, and associated bin scripts. GOTTCHA2 will be spun off into its own standalone Nextflow workflow.
Removes clumpify post-processing workflow, CLUMP_READS from bbmap.nf, and three dead processes from stat.nf (EXTRACT_HUMAN_VIRUS_READS, SCRUB_HOST_READS, SCRUB_HUMAN_READS) replaced by deacon equivalents.
…ion subworkflow v2.4.0 schema is obsolete. preprocess_reads.nf logic is inlined into stat_blast_workflow.nf. host_depletion.nf index resolution logic is inlined where needed.
The tools field was removed from NvdParams but the @field_validator for it remained, causing a pydantic PydanticUserError on import.
DEACON_BUILD_INDEX is not used anywhere and is now removed
With the update to v3.0 we hope to have no processes requiring more than the high process
Adds --summary flag to produce a structured JSON with seqs_in, seqs_out, bp_in, bp_out, and timing stats. Emits the JSON as a named output (stats) alongside the filtered reads (reads). The seqs_in field provides total input read counts, enabling downstream extraction without a separate counting step.
Removes the COUNT_READS process call (~2 hours decompressing full R1+R2) and extracts total read counts from the deacon filter summary JSON via groovy.json.JsonSlurper. The seqs_in field provides the same tuple(sample_id, total_reads) shape that BUNDLE_BLAST_FOR_LABKEY expects, at zero additional compute cost.
Appends a total_reads column to the final merged BLAST TSV so read counts are always present in published results regardless of whether LabKey is enabled. Uses awk to add the column after the LCA annotation step. Takes tuple(sample_id, blast_tsv, total_reads) and outputs the same filename with the new column.
Inserts the read count annotation step between CLASSIFY_WITH_BLASTN and downstream consumers (REGISTER_HITS, BUNDLE_BLAST_FOR_LABKEY). All downstream references now use ADD_READ_COUNTS_TO_BLAST.out instead of CLASSIFY_WITH_BLASTN.out.merged_results. Adds publishDir entry so the final TSV with total_reads lands in blast_merged.
The input and output filenames were both ${sample_id}_blast.merged_with_lca.tsv.
Nextflow stages the input as a symlink with that name, and the shell
redirect truncates it to 0 bytes before awk can read it. Renames output
to ${sample_id}_blast.final.tsv to avoid the collision.
…d GOTTCHA2 helpers
This commit makes the v3 configuration contract match the pipeline we actually intend to ship. Earlier versions accumulated parameters for GOTTCHA2, STAT-based host scrubbing, pair merging, and other optional branches that made the command surface look broader than the now-focused STAT+BLAST workflow. For a major release, keeping those knobs around would be worse than breaking compatibility: users would reasonably assume the parameters still controlled supported behavior. The public parameter surface is now synchronized across nextflow.config, the v3 JSON schema, the Pydantic NvdParams model, the Typer CLI, preset registration, validation helpers, and generated params templates. Removed parameters include merge_pairs, scrub_host_reads, sra_human_db, human_read_scrub, and repair_pairs. The schema metadata and docs now point at nvd-params.v3.0.0, and the generated YAML template no longer suggests removed tools or GOTTCHA-era settings. Deduplication remains central to the pipeline, but its semantics are simpler. The umbrella --dedup flag enables both sequence-based clumpify deduplication and positional samtools markdup deduplication, while --dedup-seq and --dedup-pos allow users to enable either mode independently. There are no negative counterpart flags for these options, avoiding contradictory states such as enabling the umbrella dedup flag while disabling one of its intended components. Host depletion is also reframed around what users care about rather than the implementation detail. The old deacon_* names become semantic host_* parameters, and host depletion is off by default. Users opt in by providing a local host index, a host index URL, or a custom contaminant FASTA. Custom FASTA inputs now build an index with the configured k-mer and window sizes, and single-index resolution uses a symlink instead of copying potentially large index files. The release metadata moves to 3.0.0 as part of the same contract change: pyproject.toml, the default container image, schema URLs, the pipeline fingerprint, and the uv/pixi locks are refreshed together. The GOTTCHA2 Conda dependency is removed so the environment no longer pays for a feature that v3 no longer exposes.
The CLI no longer hashes main.nf and nextflow.config to decide whether it has found the NVD pipeline root. The two-file fingerprint was adding development friction without providing strong provenance: it ignored modules, subworkflows, bin scripts, schemas, and Python helpers, and development installs relaxed verification back to file-presence checks anyway. Root discovery now uses deterministic NVD-specific sentinel paths instead. An explicit NVD_PIPELINE_ROOT must point to a valid NVD checkout or fail fast, the installed/source package root is preferred over the current working directory, and cwd discovery is retained only as a container bind-mount fallback. This removes the generated _fingerprint.json artifact, the nvd-fingerprint entry point, and the local pre-commit hook that regenerated the fingerprint. Tests cover the new sentinel checks and guard against selecting an unrelated Nextflow project as the pipeline root.
The v3 image no longer clones and compiles the NCBI ncbi-vdb, sra-tools, and ngs-tools sources to provide STAT's aligns_to binary. STAT support is being removed from the v3 release, so carrying a bespoke source build in the image only slows builds and preserves a toolchain path we no longer intend to support. This also removes the image-only build tools that existed for that source build: apt no longer installs git, wget, gcc, g++, or cmake, and the pixi environment is no longer mutated with extra cxx-compiler, cmake, and make packages during the image build. The remaining apt packages are only curl for installing pixi and util-linux for the script wrapper used by pixi commands. As a Boy Scout cleanup, the duplicate final pixi cache-clean layer was dropped after removing the STAT build step. The image now installs the locked project environment and then proceeds directly to the small runtime cache permission fix.
The v3 parameter surface no longer exposes STAT database paths or STAT database versioning. The schema, Nextflow defaults, Pydantic model, Typer run command, preset registration, params template generation, and config validation now stop accepting stat_index, stat_dbss, stat_annotation, human_virus_taxlist, and stat_db_version. The replacement surface introduces direct virus enrichment inputs for deacon: virus_index, virus_index_url, virus_reference_fasta, and the associated k-mer/window/threshold controls. These mirror the host depletion shape while naming the resource by its user-facing purpose: a vertebrate-infecting virus k-mer index. This intentionally leaves parts of the Nextflow workflow broken until the STAT-to-deacon workflow internals are rewired in a follow-up commit. That temporary breakage is part of the v3 cleanup plan: first make the public contract honest, then refactor the implementation to match it. As Boy Scout cleanup, STAT was also removed from the database-version resolver path rather than preserving a dead public parameter path behind the CLI. Existing processed-sample STAT provenance remains in the state model for a later state-management cleanup.
The Slack notification tests were still importing helper functions that no longer exist in notify_slack.py. This caused full pytest collection to fail before the suite could run. The tests now target the current report-based notification helpers: build_message takes a RunReport or falls back when one is unavailable, report/highlight helpers are tested for graceful failure, and cross-run context formatting has direct coverage. Obsolete tests for removed byte-size and cumulative-stat helpers were deleted instead of preserving compatibility shims in production code. This is a Boy Scout cleanup for the test surface: the tests now describe the script that exists today rather than forcing dead helper APIs back into the implementation.
The 5-process STAT chain (two aligns_to passes + taxonomy filtering + seqkit extract) is replaced by a single deacon filter call on assembled FASTA contigs using the same virus index already built for read extraction. Also wires the new virus_index/virus_index_url/virus_reference_fasta params introduced in 96e112f into stat_blast_workflow.nf, completing the STAT database param removal.
Three wastewater metagenomics samples (illumina paired-end) staged
under tests/data/ with generic names. All files passed through deacon
host depletion (panhuman-1.k31w15.idx, abs=1, rel=0) before commit to
remove human reads:
[wkgardner@oconnor-ap nvd]$ pixi run deacon filter --deplete -a 1 -r 0 panhuman-1.k31w15.idx tests/data/water_plus_hits_R1.fastq.gz tests/data/water_plus_hits_R2.fastq.gz -o tests/data/water_plus_hits_R1.dep.fastq.gz -O tests/data/water_plus_hits_R2.dep.fastq.gz
Deacon v0.13.2; mode: deplete; input: paired; options: abs_threshold=1, rel_threshold=0, threads=8(4f+4c)
Loaded index (k=31, w=15) in 5.82s
Retained 83882/91986 sequences (91.190%), 12666182/13889886 bp (91.190%) in 5.87s. 2089382 seqs/s (315.5 Mbp/s)
[wkgardner@oconnor-ap nvd]$ pixi run deacon filter --deplete -a 1 -r 0 panhuman-1.k31w15.idx tests/data/water_R1.fastq.gz tests/data/water_R2.fastq.gz -o tests/data/water_R1.dep.fastq.gz -O tests/data/water_R2.dep.fastq.gz
Deacon v0.13.2; mode: deplete; input: paired; options: abs_threshold=1, rel_threshold=0, threads=8(4f+4c)
Loaded index (k=31, w=15) in 5.85s
Retained 73852/81954 sequences (90.114%), 11151652/12375054 bp (90.114%) in 5.89s. 2083567 seqs/s (314.6 Mbp/s)
[wkgardner@oconnor-ap nvd]$ pixi run deacon filter --deplete -a 1 -r 0 panhuman-1.k31w15.idx tests/data/hits_only_R1.fastq.gz tests/data/hits_only_R2.fastq.gz -o tests/data/hits_only_R1.dep.fastq.gz -O tests/data/hits_only_R2.dep.fastq.gz
Deacon v0.13.2; mode: deplete; input: paired; options: abs_threshold=1, rel_threshold=0, threads=8(4f+4c)
Loaded index (k=31, w=15) in 5.83s
Retained 10030/10032 sequences (99.980%), 1514530/1514832 bp (99.980%) in 5.84s. 2083567 seqs/s (314.6 Mbp/s)
[wkgardner@oconnor-ap nvd]$ mv tests/data/water_R1.dep.fastq.gz tests/data/water_R1.fastq.gz
mv tests/data/water_R2.dep.fastq.gz tests/data/water_R2.fastq.gz
mv tests/data/water_plus_hits_R1.dep.fastq.gz tests/data/water_plus_hits_R1.fastq.gz
mv tests/data/water_plus_hits_R2.dep.fastq.gz tests/data/water_plus_hits_R2.fastq.gz
mv tests/data/hits_only_R1.dep.fastq.gz tests/data/hits_only_R1.fastq.gz
mv tests/data/hits_only_R2.dep.fastq.gz tests/data/hits_only_R2.fastq.gz
[wkgardner@oconnor-ap nvd]$ seqkit stats tests/data/water_plus_hits_R1.fastq.gz
file format type num_seqs sum_len min_len avg_len max_len
tests/data/water_plus_hits_R1.fastq.gz FASTQ DNA 41,941 6,333,091 151 151 151
[wkgardner@oconnor-ap nvd]$ seqkit stats tests/data/water_plus_hits_R2.fastq.gz
file format type num_seqs sum_len min_len avg_len max_len
tests/data/water_plus_hits_R2.fastq.gz FASTQ DNA 41,941 6,333,091 151 151 151
Also adds assets/test_samplesheet.csv with relative paths to the test data.
The Python state layer now models only the main BLAST-based NVD workflow. Active state records keep BLAST database provenance and taxonomy provenance, while retired STAT/GOTTCHA2 database and upload types are no longer accepted or advertised by the state CLI. The state schema is bumped to version 3 and known v2 databases migrate in place after creating a timestamped backup. The migration preserves active BLAST rows, drops retired workflow rows from active tables, removes stat_db_version from processed_samples, and keeps the pre-migration database available in the backup rather than asking users to delete old SQLite state. This also removes STAT provenance arguments from Python-side BLAST registration/LabKey preparation helpers and updates the state tests to assert the v3 invariant: current state is BLAST-only, while v2-era data remains recoverable from the automatic backup.
The top-level Nextflow entrypoint now includes and invokes NVD_MAIN from workflows/nvd_main.nf instead of STAT_BLAST_WORKFLOW from workflows/stat_blast_workflow.nf. This keeps the workflow name aligned with the v3 direction and avoids carrying the old STAT-branded workflow name forward. The result channel variable in main.nf was renamed alongside the workflow so the public entrypoint and immediate call site use the same NVD main-workflow terminology. Nearby comments and the terminal completion message were also updated to avoid preserving the old STAT+BLAST branding in active workflow source.
The rendered Nextflow config assertion no longer expects the removed public tools selector now that v3 has a single main workflow. The check still verifies the stable top-level parameters that should always be present. Slack notification tests now avoid lint noise while keeping explicitly fake Slack-shaped token literals in source for WebClient wiring coverage. The fake tokens are centralized with S105 suppressions and comments so security tooling has a local explanation rather than repeated token-looking literals.
The extract_taxa_spots.py and hits_to_report.py scripts were intentionally retired with the STAT contig classification chain, so their characterization tests now only preserved a dead public surface. Removing those tests reduces the maintenance burden instead of resurrecting obsolete scripts. The remaining BLAST taxonomy script tests now patch taxonomy._ensure_taxdump with a helper that accepts the current state_dir/taxonomy_dir call shape while still routing taxonomy.open() to the local fixture database.
Write and query hit observations under hits/schema=v3 so the v3 BLAST-only parquet store cannot be silently mixed with older v2 parquet files. This makes the break explicit while leaving old files on disk for manual recovery if needed. With the store versioned by path, remove the old contig_id/schema-evolution compatibility layer from the DuckDB view, parquet writers, compaction normalization, and tests. The v3 schema now uses sequence_id directly and fails loudly if required canonical columns are missing. Also remove remaining STAT database argument wiring from Nextflow hit/LabKey registration paths and refresh comments so REGISTER_HITS describes the parquet observation store rather than implying hits live in the SQLite state database.
8fcd1fd to
72e27ce
Compare
Importing py_nvd.cli in the container image should not require a bundled Nextflow checkout. Make the module-level PIPELINE_ROOT best-effort so CLI imports succeed in image smoke tests, while commands that actually launch the pipeline still resolve the root lazily and fail with the existing actionable error if no root is available. Explicitly invalid NVD_PIPELINE_ROOT values continue to fail fast rather than falling back silently. The version command now reports an unavailable pipeline root instead of making import-time root discovery a hard dependency.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NVD 3.0 — Frontloaded Deacon Virus Extraction
Headline: 48 hours down to 9.5 hours per run
NVD 3.0 frontloads deacon-based human virus read extraction as the very first step in the pipeline, before interleaving and before any preprocessing. By filtering reads down to the ~1-3% that are virus-derived before touching anything else, every downstream step (dedup, trimming, quality filtering, SPADES assembly, BLAST classification) operates on a tiny subset of the original data.
Before (v2.5): Interleave (~1hr) → Preprocess full dataset (~4hr dedup + trim + filter) → STAT aligns_to virus extraction (~8-17hr) → SPADES → BLAST
After (v3.0): Deacon virus extraction on raw R1/R2 (~30min) → Preprocess virus subset (~minutes) → SPADES → BLAST
The deacon filter also replaces the separate
INTERLEAVE_PAIRSstep — for paired reads, deacon takes R1/R2 directly and outputs interleaved FASTQ in a single pass. TheCOUNT_READSprocess (~2hr) is eliminated by extracting read counts from deacon's--summaryJSON output.Validated across 35+ samples from runs 32180, 32188, 32212, and 32251 with 100% sensitivity compared to STAT and comparable BLAST classifications.
Breaking Changes
This is a major version bump (v2.5 → v3.0) per semantic versioning:
--toolsparameter removed. NVD always runs the STAT+BLAST workflow — there is no tool selection. Existing YAML files withtools:will need to remove that field.PREPROCESS_READSworkflow is deleted. Preprocessing steps (dedup, trim, host scrub, filter, repair) are inlined intoSTAT_BLAST_WORKFLOWwith individual param guards.Summary of Changes
New capabilities:
stat_to_deacon.rsconverts STAT .dbss k-mer databases to deacon .idx indexesRemoved:
PREPROCESS_READSworkflow (inlined)HOST_DEPLETIONsubworkflow (inlined)--toolsparameter andisToolSelected()logicCOUNT_READSprocessINTERLEAVE_PAIRS/MERGE_PAIRSfrom GATHER_READSSimplified:
main.nfreduced to ~15 linesGATHER_READSdoes file resolution onlyTest Plan