Skip to content

V3.0.0 rc#21

Open
wkgardner wants to merge 56 commits into
mainfrom
v3.0.0-rc
Open

V3.0.0 rc#21
wkgardner wants to merge 56 commits into
mainfrom
v3.0.0-rc

Conversation

@wkgardner
Copy link
Copy Markdown
Collaborator

NVD 3.0 — Frontloaded Deacon Virus Extraction

Headline: 48 hours down to 9.5 hours per run

NVD 3.0 frontloads deacon-based human virus read extraction as the very first step in the pipeline, before interleaving and before any preprocessing. By filtering reads down to the ~1-3% that are virus-derived before touching anything else, every downstream step (dedup, trimming, quality filtering, SPADES assembly, BLAST classification) operates on a tiny subset of the original data.

Before (v2.5): Interleave (~1hr) → Preprocess full dataset (~4hr dedup + trim + filter) → STAT aligns_to virus extraction (~8-17hr) → SPADES → BLAST

After (v3.0): Deacon virus extraction on raw R1/R2 (~30min) → Preprocess virus subset (~minutes) → SPADES → BLAST

The deacon filter also replaces the separate INTERLEAVE_PAIRS step — for paired reads, deacon takes R1/R2 directly and outputs interleaved FASTQ in a single pass. The COUNT_READS process (~2hr) is eliminated by extracting read counts from deacon's --summary JSON output.

Validated across 35+ samples from runs 32180, 32188, 32212, and 32251 with 100% sensitivity compared to STAT and comparable BLAST classifications.

Breaking Changes

This is a major version bump (v2.5 → v3.0) per semantic versioning:

  • GOTTCHA2 removed. The workflow, its LabKey integration, and all associated modules/scripts are deleted.
  • Clumpify post-processing removed. The SRA re-compression workflow is removed. Will become its own standalone pipeline or integrate into the orchestrator system.
  • --tools parameter removed. NVD always runs the STAT+BLAST workflow — there is no tool selection. Existing YAML files with tools: will need to remove that field.
  • Preprocessing inlined. The separate PREPROCESS_READS workflow is deleted. Preprocessing steps (dedup, trim, host scrub, filter, repair) are inlined into STAT_BLAST_WORKFLOW with individual param guards.

Summary of Changes

New capabilities:

  • Deacon virus extraction frontloaded as step 1 (~30min vs ~8-17hr STAT)
  • Deacon handles R1/R2 → interleaved in one pass (eliminates INTERLEAVE_PAIRS)
  • Read counts extracted from deacon --summary JSON (eliminates COUNT_READS)
  • stat_to_deacon.rs converts STAT .dbss k-mer databases to deacon .idx indexes
  • Supports both paired Illumina and single-end ONT reads

Removed:

  • GOTTCHA2 workflow and all associated files (9 files, ~2300 lines)
  • Clumpify post-processing workflow
  • PREPROCESS_READS workflow (inlined)
  • HOST_DEPLETION subworkflow (inlined)
  • --tools parameter and isToolSelected() logic
  • COUNT_READS process
  • INTERLEAVE_PAIRS / MERGE_PAIRS from GATHER_READS
  • GOTTCHA2 params from nextflow.config, CLI, models, and JSON schema
  • Old v2.4.0 schema

Simplified:

  • main.nf reduced to ~15 lines
  • GATHER_READS does file resolution only
  • No forked channels, gating logic, or completion tokens
  • JSON schema bumped to v3.0.0

Test Plan

  • Paired Illumina samples — full pipeline through BLAST
  • ONT single-end samples — deacon single-end path
  • Mixed Illumina + ONT samplesheet
  • 7-sample subset of run 32251 compared against v2.5 results
  • Full 32260 production run with Tower monitoring
  • Schema validation passes
  • py_nvd unit tests pass
  • LabKey upload verification

wkgardner added 30 commits May 15, 2026 10:15
This feature will help to reduce the run speed of
EXTRACT_HUMAN_VIRUS_READS by replacing aligns_to with deacon filter. For
this tool to make a database that behaves similar to STAT's exact k-mer
matching you must use a window size of 1.
Meat and potatoes processes to replace EXTRACT_HUMAN_VIRUS_READS
DEACON_BUILD_INDEX_FROM_STAT_K_MERS was incorrectly wired as a
per-sample process with a 9-element tuple input containing reads,
reference files, and an existing deacon index. The index build only
needs the three STAT reference files (dbss, annotation, taxids) and
produces identical output for every sample, so it should run once.

Rewrites:
- DEACON_BUILD_INDEX_FROM_STAT_K_MERS: takes 3 separate path inputs
  (stat_dbss, stat_annotation, human_virus_taxlist), runs once per
  pipeline invocation, outputs human_viruses.k31w1.idx. References
  stat_to_deacon.rs via ${projectDir}/bin/ for container compatibility.
- DEACON_FILTER_HUMAN_VIRUS_READS: simplified from 9-element tuple to
  5-element tuple (sample_id, platform, read_structure, reads, index).
  Accepts interleaved input directly — deacon handles this natively,
  eliminating the need for a separate split step. Fixed missing quotes
  on path() output declaration and removed shell comment inside a line
  continuation that would have broken the command.
This process was added to split interleaved reads into R1/R2 before
deacon filtering. Deacon handles interleaved FASTQ input natively, so
the split step is unnecessary overhead — it adds an extra process, extra
disk I/O, and intermediate files without benefit. The rewritten
DEACON_FILTER_HUMAN_VIRUS_READS accepts interleaved reads directly.
EXTRACT_HUMAN_VIRUS_READS used STAT aligns_to + seqkit grep to identify
human virus reads, taking 8-17 hours per sample. This replaces it with
two deacon-based processes validated across 35 samples at 100%
sensitivity and 60-80x speedup (~8-15 minutes per sample).

The subworkflow now:
1. Builds a deacon index from STAT k-mers (once, not per-sample) via
   DEACON_BUILD_INDEX_FROM_STAT_K_MERS
2. Filters reads per-sample via DEACON_FILTER_HUMAN_VIRUS_READS using
   the shared index, combined with each sample's reads via .combine()

The output tuple shape is unchanged — tuple(sample_id, platform,
read_structure, fastq) — so RUN_SPADES, MAP_READS_TO_CONTIGS, and the
viral_reads emit all work without modification.

Also updates conf/results.config to publish DEACON_FILTER_HUMAN_VIRUS_READS
output to the same human_virus_reads directory.
rust-script is not available for aarch.
…US_READS

Deacon's pair-aware mode requires interleaved input via stdin (- -)
to count k-mer hits from both mates jointly. When given a single file
positionally, deacon treats it as single-end. The process now branches
on read_structure to use the correct invocation:

- Interleaved (Illumina): zcat | deacon filter ... - -
  Pair-aware k-mer counting where hits from either mate contribute
  to the filtering threshold for the read pair.

- Single-end (ONT): deacon filter ... ${reads}
  Standard single-read filtering with no pairing logic.

This ensures correct sensitivity for both sequencing platforms when
virus extraction is frontloaded before preprocessing.
Virus read extraction now happens upstream in STAT_BLAST_WORKFLOW via
DEACON_FILTER_HUMAN_VIRUS_READS, so PREPROCESS_CONTIGS no longer needs
to perform extraction. Removes deacon process imports and the three
STAT reference file take channels (ch_stat_dbss, ch_stat_annotation,
ch_human_virus_taxlist).

The subworkflow now takes a single channel of already-filtered and
preprocessed reads and runs: RUN_SPADES -> MASK_LOW_COMPLEXITY ->
FILTER_SHORT_CONTIGS. The viral_reads emit passes through the incoming
reads directly since they ARE the viral reads.
Moves virus read extraction to the very start of the STAT_BLAST
pipeline, running immediately after interleaving and before any
preprocessing. This reduces the dataset from full metagenomic reads
(~100% of data) to virus-only reads (~1-3%) before dedup, trimming,
and filtering — saving ~5 hours per sample.

The new pipeline order within STAT_BLAST_WORKFLOW:
1. DEACON_BUILD_INDEX_FROM_STAT_K_MERS (once, from STAT .dbss)
2. DEACON_FILTER_HUMAN_VIRUS_READS (per sample, ~30 min)
3. PREPROCESS_VIRUS_READS (aliased PREPROCESS_READS on tiny subset)
4. PREPROCESS_CONTIGS (SPAdes + mask + filter, no extraction)
5. EXTRACT_HUMAN_VIRUSES + BLAST classification

The aliased PREPROCESS_READS import (as PREPROCESS_VIRUS_READS) allows
main.nf to call its own PREPROCESS_READS instance for the GOTTCHA2
path without Nextflow DSL2 invocation conflicts.
STAT_BLAST_WORKFLOW now receives raw gathered reads instead of
preprocessed reads, since it handles its own virus extraction and
preprocessing internally on the small virus subset.

GOTTCHA2 preprocessing is gated behind STAT_BLAST completion via a
combine/map signal pattern. This frees cluster resources from the
fast STAT_BLAST path (~30 min virus extraction + fast preprocessing)
before the expensive full-dataset preprocessing starts for GOTTCHA2.

Pipeline execution order:
1. GATHER_READS (interleave) — shared by all paths
2. STAT_BLAST_WORKFLOW (immediate, virus subset) — deacon → preprocess → SPAdes → BLAST
3. PREPROCESS_READS (waits for STAT_BLAST) — full dataset for GOTTCHA2
4. GOTTCHA2_WORKFLOW (after preprocessing completes)
5. CLUMPIFY_WORKFLOW (waits for both STAT_BLAST and GOTTCHA2)
SPAdes assembles virus-only contigs from a small filtered read subset,
not full metagenomic datasets. The ludicrous tier (450GB+ RAM) is
unnecessary. The high label (20 CPUs, 64GB RAM) is sufficient for
de novo assembly of extracted virus reads.
Deacon now takes R1/R2 as separate files for paired reads and outputs
interleaved FASTQ in a single pass, combining virus filtering and
interleaving into one step. This eliminates the ~1 hour INTERLEAVE_PAIRS
bottleneck for the STAT_BLAST path.

Input is now tuple(sample_id, platform, reads, reads2, deacon_idx)
where reads2 is a sentinel file "NO_R2" for single-end (ONT) samples.
The process determines read_structure at runtime by checking reads2.name
and branches: paired passes R1 R2 as positional args to deacon, single
passes the file directly. Output tuple is the standard 4-element shape
(sample_id, platform, read_structure, fastq) for PREPROCESS_READS.
…AIRS

GATHER_READS now emits two channels:
- ch_gathered_reads: interleaved reads (for GOTTCHA2 and CLUMPIFY)
- ch_pre_interleave: raw R1/R2 tuples (for STAT_BLAST's deacon filter)

INTERLEAVE_PAIRS only runs when gottcha2 or clumpify is selected. When
only stat_blast is selected, interleaving is skipped entirely — deacon
handles R1/R2 directly and outputs interleaved as a byproduct. This
avoids wasting ~1 hour of compute on interleaving that the STAT path
does not consume.

Uses NvdUtils.isToolSelected() to gate the interleave step. The
pre-interleave channel has mixed tuple sizes: paired reads emit
(id, platform, R1, R2) while singles emit (id, platform, fastq).
Downstream normalization happens in STAT_BLAST_WORKFLOW.
COUNT_READS now accepts the same R1/R2-or-sentinel tuple pattern as
DEACON_FILTER_HUMAN_VIRUS_READS: (sample_id, platform, reads, reads2)
where reads2 is "NO_R2" for single-end samples.

For paired samples, both R1 and R2 are counted and summed to produce
the total read count — matching the result previously obtained from
counting a single interleaved file. For single-end, counts the one
file as before. This maintains backwards-compatible read counts for
LabKey reporting.
STAT_BLAST_WORKFLOW now takes pre-interleave tuples directly from
GATHER_READS instead of interleaved reads. Mixed-size tuples are
normalized with a sentinel file (NO_R2) so downstream processes get
a fixed 4-element tuple regardless of platform:
  Paired: (id, platform, R1, R2)
  Single: (id, platform, fastq, NO_R2)

COUNT_READS receives the full tuple and counts both R1+R2 for paired
samples. The deacon filter step now runs before any interleaving or
preprocessing, saving ~5 hours per sample (1hr interleave + 4hr dedup
on full data).
STAT_BLAST_WORKFLOW receives GATHER_READS.out.ch_pre_interleave (raw
R1/R2) so deacon can filter and interleave in one step. GOTTCHA2 and
CLUMPIFY receive GATHER_READS.out.ch_gathered_reads (interleaved) as
before. Named emit access (GATHER_READS.out.channel_name) replaces
the previous unnamed single-emit pattern.
Removes GOTTCHA2, clumpify, PREPROCESS_READS imports and all gating
logic. Pipeline is now: samplesheet -> GATHER_READS ->
STAT_BLAST_WORKFLOW. No tool selection, no forked channels, no
completion tokens.
Removes INTERLEAVE_PAIRS, MERGE_PAIRS, needs_interleave conditional,
ch_gathered_reads, and dual emit channels. Emits raw R1/R2 tuples
only. Deacon handles interleaving as a byproduct of virus filtering.
…ction

STAT_BLAST_WORKFLOW owns its full pipeline: deacon extraction ->
inlined preprocessing -> SPAdes -> BLAST. Preprocessing steps have
direct param guards (params.dedup, params.trim_adapters, etc.).

Removes isToolSelected gating, ch_blast_enabled signal channel,
blast_selected conditionals, and the aliased PREPROCESS_READS import.
Also removes when: params.tools guard from FETCH_FASTQ in sratools.nf.
Removes isToolSelected(), BLAST_ALIASES, LABKEY_GOTTCHA2_PARAMS,
and validateLabkeyGottcha2(). Keeps validateLabkeyBlast() and the
shared LabKey validation helper.
Removes tools, gottcha2_db, gottcha2_db_version, min_gottcha_reads,
labkey_gottcha_* params, sra_human_db, and human_read_scrub.
Removes gottcha2_results, clumpify_results path params and their
publishDir entries (CLUMP_READS, SCRUB_HUMAN_READS,
LABKEY_UPLOAD_GOTTCHA2_FASTA, LABKEY_UPLOAD_GOTTCHA2_FULL).
Removes tools, gottcha2_db, min_gottcha_reads, and labkey_gottcha_*
from NvdParams model and CLI options. Updates DbType, UploadType,
and HitSource literals to remove gottcha2 variants. Cleans state.py
database resolution and schema.sql CHECK constraints.
Removes tools, gottcha2_db, gottcha2_db_version, min_gottcha_reads,
labkey_gottcha_*, sra_human_db, human_read_scrub properties.
Renames schema file to v3.0.0 to match the major version bump.
Removes the GOTTCHA2 workflow, its LabKey integration, hit registration,
and associated bin scripts. GOTTCHA2 will be spun off into its own
standalone Nextflow workflow.
Removes clumpify post-processing workflow, CLUMP_READS from bbmap.nf,
and three dead processes from stat.nf (EXTRACT_HUMAN_VIRUS_READS,
SCRUB_HOST_READS, SCRUB_HUMAN_READS) replaced by deacon equivalents.
…ion subworkflow

v2.4.0 schema is obsolete. preprocess_reads.nf logic is inlined into
stat_blast_workflow.nf. host_depletion.nf index resolution logic is
inlined where needed.
The tools field was removed from NvdParams but the @field_validator
for it remained, causing a pydantic PydanticUserError on import.
wkgardner and others added 21 commits May 15, 2026 10:26
DEACON_BUILD_INDEX is not used anywhere and is now removed
With the update to v3.0 we hope to have no processes requiring more than
the high process
Adds --summary flag to produce a structured JSON with seqs_in,
seqs_out, bp_in, bp_out, and timing stats. Emits the JSON as a
named output (stats) alongside the filtered reads (reads). The
seqs_in field provides total input read counts, enabling downstream
extraction without a separate counting step.
Removes the COUNT_READS process call (~2 hours decompressing full
R1+R2) and extracts total read counts from the deacon filter summary
JSON via groovy.json.JsonSlurper. The seqs_in field provides the
same tuple(sample_id, total_reads) shape that BUNDLE_BLAST_FOR_LABKEY
expects, at zero additional compute cost.
Appends a total_reads column to the final merged BLAST TSV so read
counts are always present in published results regardless of whether
LabKey is enabled. Uses awk to add the column after the LCA annotation
step. Takes tuple(sample_id, blast_tsv, total_reads) and outputs the
same filename with the new column.
Inserts the read count annotation step between CLASSIFY_WITH_BLASTN
and downstream consumers (REGISTER_HITS, BUNDLE_BLAST_FOR_LABKEY).
All downstream references now use ADD_READ_COUNTS_TO_BLAST.out
instead of CLASSIFY_WITH_BLASTN.out.merged_results. Adds publishDir
entry so the final TSV with total_reads lands in blast_merged.
The input and output filenames were both ${sample_id}_blast.merged_with_lca.tsv.
Nextflow stages the input as a symlink with that name, and the shell
redirect truncates it to 0 bytes before awk can read it. Renames output
to ${sample_id}_blast.final.tsv to avoid the collision.
This commit makes the v3 configuration contract match the pipeline we actually intend to ship. Earlier versions accumulated parameters for GOTTCHA2, STAT-based host scrubbing, pair merging, and other optional branches that made the command surface look broader than the now-focused STAT+BLAST workflow. For a major release, keeping those knobs around would be worse than breaking compatibility: users would reasonably assume the parameters still controlled supported behavior.

The public parameter surface is now synchronized across nextflow.config, the v3 JSON schema, the Pydantic NvdParams model, the Typer CLI, preset registration, validation helpers, and generated params templates. Removed parameters include merge_pairs, scrub_host_reads, sra_human_db, human_read_scrub, and repair_pairs. The schema metadata and docs now point at nvd-params.v3.0.0, and the generated YAML template no longer suggests removed tools or GOTTCHA-era settings.

Deduplication remains central to the pipeline, but its semantics are simpler. The umbrella --dedup flag enables both sequence-based clumpify deduplication and positional samtools markdup deduplication, while --dedup-seq and --dedup-pos allow users to enable either mode independently. There are no negative counterpart flags for these options, avoiding contradictory states such as enabling the umbrella dedup flag while disabling one of its intended components.

Host depletion is also reframed around what users care about rather than the implementation detail. The old deacon_* names become semantic host_* parameters, and host depletion is off by default. Users opt in by providing a local host index, a host index URL, or a custom contaminant FASTA. Custom FASTA inputs now build an index with the configured k-mer and window sizes, and single-index resolution uses a symlink instead of copying potentially large index files.

The release metadata moves to 3.0.0 as part of the same contract change: pyproject.toml, the default container image, schema URLs, the pipeline fingerprint, and the uv/pixi locks are refreshed together. The GOTTCHA2 Conda dependency is removed so the environment no longer pays for a feature that v3 no longer exposes.
The CLI no longer hashes main.nf and nextflow.config to decide whether it has found the NVD pipeline root. The two-file fingerprint was adding development friction without providing strong provenance: it ignored modules, subworkflows, bin scripts, schemas, and Python helpers, and development installs relaxed verification back to file-presence checks anyway.

Root discovery now uses deterministic NVD-specific sentinel paths instead. An explicit NVD_PIPELINE_ROOT must point to a valid NVD checkout or fail fast, the installed/source package root is preferred over the current working directory, and cwd discovery is retained only as a container bind-mount fallback.

This removes the generated _fingerprint.json artifact, the nvd-fingerprint entry point, and the local pre-commit hook that regenerated the fingerprint. Tests cover the new sentinel checks and guard against selecting an unrelated Nextflow project as the pipeline root.
The v3 image no longer clones and compiles the NCBI ncbi-vdb, sra-tools, and ngs-tools sources to provide STAT's aligns_to binary. STAT support is being removed from the v3 release, so carrying a bespoke source build in the image only slows builds and preserves a toolchain path we no longer intend to support.

This also removes the image-only build tools that existed for that source build: apt no longer installs git, wget, gcc, g++, or cmake, and the pixi environment is no longer mutated with extra cxx-compiler, cmake, and make packages during the image build. The remaining apt packages are only curl for installing pixi and util-linux for the script wrapper used by pixi commands.

As a Boy Scout cleanup, the duplicate final pixi cache-clean layer was dropped after removing the STAT build step. The image now installs the locked project environment and then proceeds directly to the small runtime cache permission fix.
The v3 parameter surface no longer exposes STAT database paths or STAT database versioning. The schema, Nextflow defaults, Pydantic model, Typer run command, preset registration, params template generation, and config validation now stop accepting stat_index, stat_dbss, stat_annotation, human_virus_taxlist, and stat_db_version.

The replacement surface introduces direct virus enrichment inputs for deacon: virus_index, virus_index_url, virus_reference_fasta, and the associated k-mer/window/threshold controls. These mirror the host depletion shape while naming the resource by its user-facing purpose: a vertebrate-infecting virus k-mer index.

This intentionally leaves parts of the Nextflow workflow broken until the STAT-to-deacon workflow internals are rewired in a follow-up commit. That temporary breakage is part of the v3 cleanup plan: first make the public contract honest, then refactor the implementation to match it.

As Boy Scout cleanup, STAT was also removed from the database-version resolver path rather than preserving a dead public parameter path behind the CLI. Existing processed-sample STAT provenance remains in the state model for a later state-management cleanup.
The Slack notification tests were still importing helper functions that no longer exist in notify_slack.py. This caused full pytest collection to fail before the suite could run.

The tests now target the current report-based notification helpers: build_message takes a RunReport or falls back when one is unavailable, report/highlight helpers are tested for graceful failure, and cross-run context formatting has direct coverage. Obsolete tests for removed byte-size and cumulative-stat helpers were deleted instead of preserving compatibility shims in production code.

This is a Boy Scout cleanup for the test surface: the tests now describe the script that exists today rather than forcing dead helper APIs back into the implementation.
The 5-process STAT chain (two aligns_to passes + taxonomy filtering + seqkit extract) is replaced by a single deacon filter call on assembled FASTA contigs using the same virus index already built for read extraction.

Also wires the new virus_index/virus_index_url/virus_reference_fasta params introduced in 96e112f into stat_blast_workflow.nf, completing the STAT database param removal.
Three wastewater metagenomics samples (illumina paired-end) staged
under tests/data/ with generic names. All files passed through deacon
host depletion (panhuman-1.k31w15.idx, abs=1, rel=0) before commit to
remove human reads:

    [wkgardner@oconnor-ap nvd]$ pixi run deacon filter --deplete -a 1 -r 0 panhuman-1.k31w15.idx tests/data/water_plus_hits_R1.fastq.gz tests/data/water_plus_hits_R2.fastq.gz -o tests/data/water_plus_hits_R1.dep.fastq.gz -O tests/data/water_plus_hits_R2.dep.fastq.gz
    Deacon v0.13.2; mode: deplete; input: paired; options: abs_threshold=1, rel_threshold=0, threads=8(4f+4c)
    Loaded index (k=31, w=15) in 5.82s
    Retained 83882/91986 sequences (91.190%), 12666182/13889886 bp (91.190%) in 5.87s. 2089382 seqs/s (315.5 Mbp/s)
    [wkgardner@oconnor-ap nvd]$ pixi run deacon filter --deplete -a 1 -r 0 panhuman-1.k31w15.idx tests/data/water_R1.fastq.gz tests/data/water_R2.fastq.gz -o tests/data/water_R1.dep.fastq.gz -O tests/data/water_R2.dep.fastq.gz
    Deacon v0.13.2; mode: deplete; input: paired; options: abs_threshold=1, rel_threshold=0, threads=8(4f+4c)
    Loaded index (k=31, w=15) in 5.85s
    Retained 73852/81954 sequences (90.114%), 11151652/12375054 bp (90.114%) in 5.89s. 2083567 seqs/s (314.6 Mbp/s)
    [wkgardner@oconnor-ap nvd]$ pixi run deacon filter --deplete -a 1 -r 0 panhuman-1.k31w15.idx tests/data/hits_only_R1.fastq.gz tests/data/hits_only_R2.fastq.gz -o tests/data/hits_only_R1.dep.fastq.gz -O tests/data/hits_only_R2.dep.fastq.gz
    Deacon v0.13.2; mode: deplete; input: paired; options: abs_threshold=1, rel_threshold=0, threads=8(4f+4c)
    Loaded index (k=31, w=15) in 5.83s
    Retained 10030/10032 sequences (99.980%), 1514530/1514832 bp (99.980%) in 5.84s. 2083567 seqs/s (314.6 Mbp/s)
    [wkgardner@oconnor-ap nvd]$ mv tests/data/water_R1.dep.fastq.gz tests/data/water_R1.fastq.gz
    mv tests/data/water_R2.dep.fastq.gz tests/data/water_R2.fastq.gz
    mv tests/data/water_plus_hits_R1.dep.fastq.gz tests/data/water_plus_hits_R1.fastq.gz
    mv tests/data/water_plus_hits_R2.dep.fastq.gz tests/data/water_plus_hits_R2.fastq.gz
    mv tests/data/hits_only_R1.dep.fastq.gz tests/data/hits_only_R1.fastq.gz
    mv tests/data/hits_only_R2.dep.fastq.gz tests/data/hits_only_R2.fastq.gz
    [wkgardner@oconnor-ap nvd]$ seqkit stats tests/data/water_plus_hits_R1.fastq.gz
    file                                    format  type  num_seqs    sum_len  min_len  avg_len  max_len
    tests/data/water_plus_hits_R1.fastq.gz  FASTQ   DNA     41,941  6,333,091      151      151      151
    [wkgardner@oconnor-ap nvd]$ seqkit stats tests/data/water_plus_hits_R2.fastq.gz
    file                                    format  type  num_seqs    sum_len  min_len  avg_len  max_len
    tests/data/water_plus_hits_R2.fastq.gz  FASTQ   DNA     41,941  6,333,091      151      151      151

Also adds assets/test_samplesheet.csv with relative paths to the test data.
The Python state layer now models only the main BLAST-based NVD workflow. Active state records keep BLAST database provenance and taxonomy provenance, while retired STAT/GOTTCHA2 database and upload types are no longer accepted or advertised by the state CLI.

The state schema is bumped to version 3 and known v2 databases migrate in place after creating a timestamped backup. The migration preserves active BLAST rows, drops retired workflow rows from active tables, removes stat_db_version from processed_samples, and keeps the pre-migration database available in the backup rather than asking users to delete old SQLite state.

This also removes STAT provenance arguments from Python-side BLAST registration/LabKey preparation helpers and updates the state tests to assert the v3 invariant: current state is BLAST-only, while v2-era data remains recoverable from the automatic backup.
The top-level Nextflow entrypoint now includes and invokes NVD_MAIN from workflows/nvd_main.nf instead of STAT_BLAST_WORKFLOW from workflows/stat_blast_workflow.nf. This keeps the workflow name aligned with the v3 direction and avoids carrying the old STAT-branded workflow name forward.

The result channel variable in main.nf was renamed alongside the workflow so the public entrypoint and immediate call site use the same NVD main-workflow terminology. Nearby comments and the terminal completion message were also updated to avoid preserving the old STAT+BLAST branding in active workflow source.
nrminor added 4 commits May 15, 2026 11:29
The rendered Nextflow config assertion no longer expects the removed public tools selector now that v3 has a single main workflow. The check still verifies the stable top-level parameters that should always be present.

Slack notification tests now avoid lint noise while keeping explicitly fake Slack-shaped token literals in source for WebClient wiring coverage. The fake tokens are centralized with S105 suppressions and comments so security tooling has a local explanation rather than repeated token-looking literals.
The extract_taxa_spots.py and hits_to_report.py scripts were intentionally retired with the STAT contig classification chain, so their characterization tests now only preserved a dead public surface. Removing those tests reduces the maintenance burden instead of resurrecting obsolete scripts.

The remaining BLAST taxonomy script tests now patch taxonomy._ensure_taxdump with a helper that accepts the current state_dir/taxonomy_dir call shape while still routing taxonomy.open() to the local fixture database.
Write and query hit observations under hits/schema=v3 so the v3 BLAST-only parquet store cannot be silently mixed with older v2 parquet files. This makes the break explicit while leaving old files on disk for manual recovery if needed.

With the store versioned by path, remove the old contig_id/schema-evolution compatibility layer from the DuckDB view, parquet writers, compaction normalization, and tests. The v3 schema now uses sequence_id directly and fails loudly if required canonical columns are missing.

Also remove remaining STAT database argument wiring from Nextflow hit/LabKey registration paths and refresh comments so REGISTER_HITS describes the parquet observation store rather than implying hits live in the SQLite state database.
@nrminor nrminor force-pushed the v3.0.0-rc branch 2 times, most recently from 8fcd1fd to 72e27ce Compare May 16, 2026 02:07
Importing py_nvd.cli in the container image should not require a bundled Nextflow checkout. Make the module-level PIPELINE_ROOT best-effort so CLI imports succeed in image smoke tests, while commands that actually launch the pipeline still resolve the root lazily and fail with the existing actionable error if no root is available.

Explicitly invalid NVD_PIPELINE_ROOT values continue to fail fast rather than falling back silently. The version command now reports an unavailable pipeline root instead of making import-time root discovery a hard dependency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants