Add configuration system, documentation, VAMB compatibility layer, and test suite by Copilot · Pull Request #9 · linsalrob/atavide_lite

Copilot · 2026-01-28T05:11:46Z

Pipeline lacks standardized configuration, step-by-step documentation, and fails on VAMB version mismatches. Users struggle with setup and debugging across HPC systems.

Configuration System

config/paths.env.example: Shell-sourceable template for paths, databases, resources
config/samples.tsv.example: Tab-delimited sample sheet supporting paired/single-end, absolute paths

Documentation

docs/directory_contract.md: I/O specifications per step (inputs, outputs, validation, resources, failure modes)
docs/known_good_versions.md: Tested tool versions, database tracking, reproducibility baseline
docs/compat.md: VAMB version breaks, MMseqs2 DB formats, HPC filesystem differences
docs/dev_notes.md: Shellcheck patterns, code standards, testing guidance

Shared Helper Library

lib/common.sh provides POSIX-compatible functions:

Validation: require_cmd, require_file, require_dir, check_nonempty
Logging: log(), die(), timestamped output
HPC portability: get_array_task_id(), get_fast_storage() (Slurm/PBS/system detection)
Config loading with helpful error messages

Example usage:

source "${SCRIPT_DIR}/../lib/common.sh"
init_script
require_cmd fastp minimap2 samtools
require_file "${input}.fastq.gz" || die "Input not found"
log "Processing ${sample}"

See pawsey_shortread/fastp_enhanced.slurm for complete example.

VAMB Compatibility

Problem: vamb_create_fasta.py imports vamb module → breaks on API changes between v3.x and v4.x

Solution:

vamb_create_fasta_clusters.py (canonical): Reads clusters.tsv directly, no vamb import, works across all versions
vamb_create_fasta.py: Deprecated with version detection, actionable error messages

Both scripts enhanced with input validation, progress logging, better arg names.

Test Suite

Comprehensive test suite with 61 test cases across 3 test suites:

tests/test_common.sh: 25 tests for all lib/common.sh functions (logging, validation, file processing, HPC helpers, config loading)
tests/test_vamb_create_fasta_clusters.py: 17 tests for VAMB binning script (basic functionality, size filtering, error handling)
tests/test_config_files.sh: 19 tests for configuration file validation (syntax, required variables, format)
tests/run_tests.sh: Main test runner with colored output
tests/README.md: Complete test suite documentation with usage examples
.github/workflows/test.yml: GitHub Actions CI workflow for automated testing

All tests passing ✅. Tests use isolated environments, no external dependencies beyond standard tools.

Changes Summary

19 new files, 3 modified
+3,925 lines (2,106 documentation, 1,328 tests, 491 code)
Backward compatible: existing scripts unchanged, DEFINITIONS.sh still supported
Updated README with quick start, documentation links
Automated CI testing via GitHub Actions

Original prompt

Copilot Instructions — atavide_lite

You are GitHub Copilot working in the atavide_lite repository. Your goal is to make the pipeline easier to run, easier to debug, and more reproducible across HPC systems, without converting it to Snakemake and without large architectural rewrites.

Guiding principles

Do not redesign the pipeline (no Snakemake/Nextflow rewrite).
Prefer small, composable changes that improve:
- reproducibility (version/DB pinning)
- portability (shared config format)
- clarity (directory contracts / expected outputs)
- robustness (fail-fast checks, helpful errors)
Keep cluster-specific scripts, but standardize their interface via shared config and consistent conventions.
Make changes that are friendly to HPC usage:
- Bash + Slurm
- environment modules and/or conda/venv
- Apptainer/Singularity optional, not mandatory

Phase 1 — Documentation + “front door” config (highest priority)

A) Add a single canonical configuration format

Create a new directory:

config/
- paths.env.example — shell KEY=VALUE entries that can be source’d
- samples.tsv.example — example samplesheet for paired and single-end runs

Requirements for paths.env.example:

must include placeholders for:
- scratch/work directory root
- input fastq directory (or allow per-sample absolute paths in samples.tsv)
- host reference (if used)
- mmseqs databases (UniRef or other)
- BV-BRC mapping files/snapshots (if used)
- threads, memory defaults, tmp directory
include comments explaining each variable
do not include real paths; use obvious placeholders

Requirements for samples.tsv.example:

tab-delimited
include a header row
include both examples:
- paired-end: sample_id, r1, r2
- single-end: sample_id, r1
allow absolute paths
include optional columns (can be empty): host_ref, group, notes

B) Add “directory contract” documentation

Create docs/directory_contract.md documenting, for each step:

required inputs
outputs (filenames + directory layout)
how to validate success (e.g., files exist + non-empty + key metrics file)
typical resource notes (threads/mem)
failure symptoms and where to look (stderr, logs)

Keep it concise but explicit.

C) Add “known-good versions” documentation

Create docs/known_good_versions.md with:

a table listing the major tools and tested versions:
- fastp
- minimap2
- samtools
- mmseqs2
- megahit
- vamb
- python version (for helper scripts)
a section for database snapshots (names + how to record version/date)

Include a short note:

this is not strict, but serves as a tested baseline for reproducibility.

D) Update README to point to the above

Update README.md:

add a “Quick start (conceptual)” section with:
1. copy example config files
2. edit paths
3. run the appropriate cluster script
link to:
- docs/directory_contract.md
- docs/known_good_versions.md
- config examples

Phase 2 — Standardize script interfaces (medium priority)

A) Add a shared Bash helper: `lib/common.sh`

Create lib/common.sh and ensure cluster scripts can source it.

It must provide:

die() — print error to stderr and exit non-zero
log() — timestamped logging
require_cmd() — assert a command exists in PATH
require_file() — assert a file exists and is readable
require_dir() — assert a directory exists (or create it if appropriate)
check_nonempty() — assert output exists and size > 0
load_config() — source config/paths.env if present, else error with guidance

Do not introduce external dependencies beyond POSIX/Bash.

B) Add consistent safe Bash practices in scripts

Where appropriate:

set -euo pipefail
trap with line number and command (trap '...' ERR)
quote variables
use find ... -print0 + while IFS= read -r -d '' patterns for filenames

Do not rewrite every script at once; start with 1–2 representative ones.

Phase 3 — Fix/contain VAMB script drift (high priority because it breaks runs)

A) Establish a single canonical VAMB helper script

Identify the two similar scripts (e.g., vamb_create_fasta.py and vamb_create_fasta_clusters.py).

Actions:

Pick one as canonical (prefer the more general one).
Deprecate the other:
- keep it but add a top-of-file comment: “DEPRECATED — use ”
- or replace it with a small wrapper calling the canonical script.

B) Add a compatibility layer for VAMB version differences

In the canonical script:

detect VAMB version using one of:
- import vamb; vamb.__version__ (preferred if available)
- python -c "import vamb; print(vamb.__version__)" in wrappers
adapt to known argument changes (e.g., minsize vs removed/renamed args)
if version detection fails, fail fast with a helpful message:
- “Tested with VAMB X.Y.Z; please install that or update compat logic.”

Add a short doc: docs/compat.md capturi...

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>

linsalrob · 2026-01-28T05:27:36Z

@copilot can you design a test suite for this code?

Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>

Copilot · 2026-01-28T05:33:20Z

@copilot can you design a test suite for this code?

I've designed and implemented a comprehensive test suite for the pipeline improvements. Commit: 3e9201b

Test Suite Overview

61 test cases across 3 test suites, all passing ✅

1. Bash Helper Tests (`tests/test_common.sh`)

Tests all 25+ functions in lib/common.sh:

Logging (log, log_error, log_warn, die)
Validation (require_cmd, require_file, require_dir, check_nonempty, require_var)
File processing (count_fastq_reads, count_fasta_sequences, file_size)
HPC helpers (get_array_task_id, detect_scheduler, get_fast_storage)
Config loading and output functions

2. VAMB Script Tests (`tests/test_vamb_create_fasta_clusters.py`)

Tests the canonical VAMB binning script (17 tests):

Basic clustering, size filtering, verbose mode
Error handling for missing/malformed inputs
Output validation (compressed FASTA files)

3. Config File Tests (`tests/test_config_files.sh`)

Validates configuration templates (19 tests):

Syntax checking (sourceable without errors)
Required variables present
Tab-delimited format validation

Running Tests

cd tests
./run_tests.sh              # Run all tests
./run_tests.sh --verbose    # Detailed output

Individual suites can be run separately. See tests/README.md for details.

CI Integration

Added .github/workflows/test.yml for automated testing on push/PR.

Initial plan

2fd2408

Copilot AI assigned Copilot and linsalrob Jan 28, 2026

Copilot started work on behalf of linsalrob January 28, 2026 05:12 View session

Copilot AI and others added 5 commits January 28, 2026 05:17

Add Phase 1: Configuration files and documentation

050659e

Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>

Add lib/common.sh with shared helper functions

ce5deb7

Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>

Phase 3: Improve VAMB script compatibility and documentation

20e5e84

Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>

Add enhanced example script using lib/common.sh and lib documentation

13de07e

Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>

Add comprehensive implementation summary document

fb08756

Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Improve pipeline documentation and configuration setup~~ Add configuration system, documentation, and VAMB compatibility layer Jan 28, 2026

Copilot AI requested a review from linsalrob January 28, 2026 05:22

Copilot finished work on behalf of linsalrob January 28, 2026 05:22

Copilot started work on behalf of linsalrob January 28, 2026 05:27 View session

Add comprehensive test suite for pipeline improvements

3e9201b

Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>

Copilot AI changed the title ~~Add configuration system, documentation, and VAMB compatibility layer~~ Add configuration system, documentation, VAMB compatibility layer, and test suite Jan 28, 2026

Copilot finished work on behalf of linsalrob January 28, 2026 05:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configuration system, documentation, VAMB compatibility layer, and test suite#9

Add configuration system, documentation, VAMB compatibility layer, and test suite#9
Copilot wants to merge 7 commits intomainfrom
copilot/improve-pipeline-documentation

Copilot AI commented Jan 28, 2026 •

edited

Loading

Uh oh!

linsalrob commented Jan 28, 2026

Uh oh!

Copilot AI commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Configuration System

Documentation

Shared Helper Library

VAMB Compatibility

Test Suite

Changes Summary

Copilot Instructions — atavide_lite

Guiding principles

Phase 1 — Documentation + “front door” config (highest priority)

A) Add a single canonical configuration format

B) Add “directory contract” documentation

C) Add “known-good versions” documentation

D) Update README to point to the above

Phase 2 — Standardize script interfaces (medium priority)

A) Add a shared Bash helper: lib/common.sh

B) Add consistent safe Bash practices in scripts

Phase 3 — Fix/contain VAMB script drift (high priority because it breaks runs)

A) Establish a single canonical VAMB helper script

B) Add a compatibility layer for VAMB version differences

Uh oh!

linsalrob commented Jan 28, 2026

Uh oh!

Copilot AI commented Jan 28, 2026

Test Suite Overview

1. Bash Helper Tests (tests/test_common.sh)

2. VAMB Script Tests (tests/test_vamb_create_fasta_clusters.py)

3. Config File Tests (tests/test_config_files.sh)

Running Tests

CI Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 28, 2026 •

edited

Loading

A) Add a shared Bash helper: `lib/common.sh`

1. Bash Helper Tests (`tests/test_common.sh`)

2. VAMB Script Tests (`tests/test_vamb_create_fasta_clusters.py`)

3. Config File Tests (`tests/test_config_files.sh`)