Skip to content

Add configuration system, documentation, VAMB compatibility layer, and test suite#9

Draft
Copilot wants to merge 7 commits intomainfrom
copilot/improve-pipeline-documentation
Draft

Add configuration system, documentation, VAMB compatibility layer, and test suite#9
Copilot wants to merge 7 commits intomainfrom
copilot/improve-pipeline-documentation

Conversation

Copy link

Copilot AI commented Jan 28, 2026

Pipeline lacks standardized configuration, step-by-step documentation, and fails on VAMB version mismatches. Users struggle with setup and debugging across HPC systems.

Configuration System

  • config/paths.env.example: Shell-sourceable template for paths, databases, resources
  • config/samples.tsv.example: Tab-delimited sample sheet supporting paired/single-end, absolute paths

Documentation

  • docs/directory_contract.md: I/O specifications per step (inputs, outputs, validation, resources, failure modes)
  • docs/known_good_versions.md: Tested tool versions, database tracking, reproducibility baseline
  • docs/compat.md: VAMB version breaks, MMseqs2 DB formats, HPC filesystem differences
  • docs/dev_notes.md: Shellcheck patterns, code standards, testing guidance

Shared Helper Library

lib/common.sh provides POSIX-compatible functions:

  • Validation: require_cmd, require_file, require_dir, check_nonempty
  • Logging: log(), die(), timestamped output
  • HPC portability: get_array_task_id(), get_fast_storage() (Slurm/PBS/system detection)
  • Config loading with helpful error messages

Example usage:

source "${SCRIPT_DIR}/../lib/common.sh"
init_script
require_cmd fastp minimap2 samtools
require_file "${input}.fastq.gz" || die "Input not found"
log "Processing ${sample}"

See pawsey_shortread/fastp_enhanced.slurm for complete example.

VAMB Compatibility

Problem: vamb_create_fasta.py imports vamb module → breaks on API changes between v3.x and v4.x

Solution:

  • vamb_create_fasta_clusters.py (canonical): Reads clusters.tsv directly, no vamb import, works across all versions
  • vamb_create_fasta.py: Deprecated with version detection, actionable error messages

Both scripts enhanced with input validation, progress logging, better arg names.

Test Suite

Comprehensive test suite with 61 test cases across 3 test suites:

  • tests/test_common.sh: 25 tests for all lib/common.sh functions (logging, validation, file processing, HPC helpers, config loading)
  • tests/test_vamb_create_fasta_clusters.py: 17 tests for VAMB binning script (basic functionality, size filtering, error handling)
  • tests/test_config_files.sh: 19 tests for configuration file validation (syntax, required variables, format)
  • tests/run_tests.sh: Main test runner with colored output
  • tests/README.md: Complete test suite documentation with usage examples
  • .github/workflows/test.yml: GitHub Actions CI workflow for automated testing

All tests passing ✅. Tests use isolated environments, no external dependencies beyond standard tools.

Changes Summary

  • 19 new files, 3 modified
  • +3,925 lines (2,106 documentation, 1,328 tests, 491 code)
  • Backward compatible: existing scripts unchanged, DEFINITIONS.sh still supported
  • Updated README with quick start, documentation links
  • Automated CI testing via GitHub Actions
Original prompt

Copilot Instructions — atavide_lite

You are GitHub Copilot working in the atavide_lite repository. Your goal is to make the pipeline easier to run, easier to debug, and more reproducible across HPC systems, without converting it to Snakemake and without large architectural rewrites.

Guiding principles

  1. Do not redesign the pipeline (no Snakemake/Nextflow rewrite).
  2. Prefer small, composable changes that improve:
    • reproducibility (version/DB pinning)
    • portability (shared config format)
    • clarity (directory contracts / expected outputs)
    • robustness (fail-fast checks, helpful errors)
  3. Keep cluster-specific scripts, but standardize their interface via shared config and consistent conventions.
  4. Make changes that are friendly to HPC usage:
    • Bash + Slurm
    • environment modules and/or conda/venv
    • Apptainer/Singularity optional, not mandatory

Phase 1 — Documentation + “front door” config (highest priority)

A) Add a single canonical configuration format

Create a new directory:

  • config/
    • paths.env.example — shell KEY=VALUE entries that can be source’d
    • samples.tsv.example — example samplesheet for paired and single-end runs

Requirements for paths.env.example:

  • must include placeholders for:
    • scratch/work directory root
    • input fastq directory (or allow per-sample absolute paths in samples.tsv)
    • host reference (if used)
    • mmseqs databases (UniRef or other)
    • BV-BRC mapping files/snapshots (if used)
    • threads, memory defaults, tmp directory
  • include comments explaining each variable
  • do not include real paths; use obvious placeholders

Requirements for samples.tsv.example:

  • tab-delimited
  • include a header row
  • include both examples:
    • paired-end: sample_id, r1, r2
    • single-end: sample_id, r1
  • allow absolute paths
  • include optional columns (can be empty): host_ref, group, notes

B) Add “directory contract” documentation

Create docs/directory_contract.md documenting, for each step:

  • required inputs
  • outputs (filenames + directory layout)
  • how to validate success (e.g., files exist + non-empty + key metrics file)
  • typical resource notes (threads/mem)
  • failure symptoms and where to look (stderr, logs)

Keep it concise but explicit.

C) Add “known-good versions” documentation

Create docs/known_good_versions.md with:

  • a table listing the major tools and tested versions:
    • fastp
    • minimap2
    • samtools
    • mmseqs2
    • megahit
    • vamb
    • python version (for helper scripts)
  • a section for database snapshots (names + how to record version/date)

Include a short note:

  • this is not strict, but serves as a tested baseline for reproducibility.

D) Update README to point to the above

Update README.md:

  • add a “Quick start (conceptual)” section with:
    1. copy example config files
    2. edit paths
    3. run the appropriate cluster script
  • link to:
    • docs/directory_contract.md
    • docs/known_good_versions.md
    • config examples

Phase 2 — Standardize script interfaces (medium priority)

A) Add a shared Bash helper: lib/common.sh

Create lib/common.sh and ensure cluster scripts can source it.

It must provide:

  • die() — print error to stderr and exit non-zero
  • log() — timestamped logging
  • require_cmd() — assert a command exists in PATH
  • require_file() — assert a file exists and is readable
  • require_dir() — assert a directory exists (or create it if appropriate)
  • check_nonempty() — assert output exists and size > 0
  • load_config() — source config/paths.env if present, else error with guidance

Do not introduce external dependencies beyond POSIX/Bash.

B) Add consistent safe Bash practices in scripts

Where appropriate:

  • set -euo pipefail
  • trap with line number and command (trap '...' ERR)
  • quote variables
  • use find ... -print0 + while IFS= read -r -d '' patterns for filenames

Do not rewrite every script at once; start with 1–2 representative ones.


Phase 3 — Fix/contain VAMB script drift (high priority because it breaks runs)

A) Establish a single canonical VAMB helper script

Identify the two similar scripts (e.g., vamb_create_fasta.py and vamb_create_fasta_clusters.py).

Actions:

  1. Pick one as canonical (prefer the more general one).
  2. Deprecate the other:
    • keep it but add a top-of-file comment: “DEPRECATED — use ”
    • or replace it with a small wrapper calling the canonical script.

B) Add a compatibility layer for VAMB version differences

In the canonical script:

  • detect VAMB version using one of:
    • import vamb; vamb.__version__ (preferred if available)
    • python -c "import vamb; print(vamb.__version__)" in wrappers
  • adapt to known argument changes (e.g., minsize vs removed/renamed args)
  • if version detection fails, fail fast with a helpful message:
    • “Tested with VAMB X.Y.Z; please install that or update compat logic.”

Add a short doc: docs/compat.md capturi...


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 5 commits January 28, 2026 05:17
Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>
Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>
Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>
Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>
Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>
Copilot AI changed the title [WIP] Improve pipeline documentation and configuration setup Add configuration system, documentation, and VAMB compatibility layer Jan 28, 2026
Copilot AI requested a review from linsalrob January 28, 2026 05:22
@linsalrob
Copy link
Owner

@copilot can you design a test suite for this code?

Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>
Copy link
Author

Copilot AI commented Jan 28, 2026

@copilot can you design a test suite for this code?

I've designed and implemented a comprehensive test suite for the pipeline improvements. Commit: 3e9201b

Test Suite Overview

61 test cases across 3 test suites, all passing ✅

1. Bash Helper Tests (tests/test_common.sh)

Tests all 25+ functions in lib/common.sh:

  • Logging (log, log_error, log_warn, die)
  • Validation (require_cmd, require_file, require_dir, check_nonempty, require_var)
  • File processing (count_fastq_reads, count_fasta_sequences, file_size)
  • HPC helpers (get_array_task_id, detect_scheduler, get_fast_storage)
  • Config loading and output functions

2. VAMB Script Tests (tests/test_vamb_create_fasta_clusters.py)

Tests the canonical VAMB binning script (17 tests):

  • Basic clustering, size filtering, verbose mode
  • Error handling for missing/malformed inputs
  • Output validation (compressed FASTA files)

3. Config File Tests (tests/test_config_files.sh)

Validates configuration templates (19 tests):

  • Syntax checking (sourceable without errors)
  • Required variables present
  • Tab-delimited format validation

Running Tests

cd tests
./run_tests.sh              # Run all tests
./run_tests.sh --verbose    # Detailed output

Individual suites can be run separately. See tests/README.md for details.

CI Integration

Added .github/workflows/test.yml for automated testing on push/PR.

Copilot AI changed the title Add configuration system, documentation, and VAMB compatibility layer Add configuration system, documentation, VAMB compatibility layer, and test suite Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants