Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,44 @@ next: richer evidence tables for additional profiles and compare mode
later: MCP/tool-agent interface and optional local summaries
```

## Deep Release Vision

Durable vision document:

```text
docs/vision-plan.md
```

FastaGuard should become the FASTA preflight operating system for modern
bioinformatics pipelines: validate the FASTA, explain red flags, emit a stable
contract, and route to the right downstream tools.

The release strategy is evidence before expansion:

```text
v0.3: evidence pack + assembly gate + provenance checksums
v0.4: compare mode for many FASTA files
v0.5: transcriptome profile
v0.6: protein profile
v0.7: reference-panel profile
later: MCP/tool-agent interface and optional local summaries
```

Default product boundaries:

- stay fast and database-free by default
- keep JSON as the source of truth
- keep HTML as a human view
- make findings machine-actionable with stable IDs, severity, evidence, thresholds, actions, and scope
- keep optional generated summaries local-metrics-only and traceable back to structured fields
- never claim to replace QUAST, BUSCO, BlobToolKit, CheckM, seqkit, MultiQC, or annotation workflows

Recommended next big release:

```text
v0.3 should make FastaGuard credible as the default assembly gate before adding broad new biological profiles.
```

## Collaboration Preference

When moving the project forward, provide a clear recommendation first, then proceed when the user approves or explicitly asks to continue. The default recommendation should favor boring, stable contracts over flashy AI features.
26 changes: 20 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ tar -xzf fastaguard-v0.2.0-x86_64-unknown-linux-gnu.tar.gz
```

The v0.2.0 GitHub release binaries and source archive are published. Bioconda
may still serve v0.1.1 until the upstream v0.2.0 recipe update is merged.
serves v0.2.0 for Linux x86_64, Linux ARM64, macOS Intel, and macOS Apple
Silicon.

Local development build:

Expand Down Expand Up @@ -67,7 +68,7 @@ fastaguard --finding-catalog
fastaguard --explain-finding high_n_rate
```

Build and run the Docker image:
Build and run the local Docker image:

```bash
docker build -t fastaguard:local .
Expand All @@ -79,6 +80,12 @@ docker run --rm -v "$PWD:/data" fastaguard:local /data/sample.fa \
--multiqc /data/fastaguard_mqc.json
```

Use the generated BioContainers image in workflow engines:

```bash
docker pull quay.io/biocontainers/fastaguard:0.2.0--hfa8f182_0
```

Exit codes:

```text
Expand Down Expand Up @@ -160,12 +167,14 @@ FastaGuard catches FASTA-level assembly problems before expensive assembly QC.

- [Example reports](examples/reports/README.md)
- [Product thesis](docs/product-thesis.md)
- [Vision plan](docs/vision-plan.md)
- [MVP spec](docs/mvp-spec.md)
- [Output contract](docs/output-contract.md)
- [Tool landscape](docs/tool-landscape.md)
- [Adoption plan](docs/adoption-plan.md)
- [LLM and tooling vision](docs/llm-tooling-vision.md)
- [Benchmarking](docs/benchmarking.md)
- [v0.2 evidence pack](docs/evidence/fastaguard-v0.2-evidence.md)
- [Packaging](docs/packaging.md)
- [v0.2.0 release notes](docs/releases/v0.2.0.md)
- [v0.1.1 release notes](docs/releases/v0.1.1.md)
Expand All @@ -175,7 +184,12 @@ FastaGuard catches FASTA-level assembly problems before expensive assembly QC.

## Status

v0.2.0 is published on GitHub with Linux and macOS release binaries. The
Bioconda v0.2.0 recipe metadata is ready with the published source archive SHA;
Bioconda may still serve v0.1.1 until the upstream recipe update is merged.
BioContainers image availability is still pending confirmation.
v0.2.0 is published on GitHub with Linux and macOS release binaries. Bioconda
serves v0.2.0 for `linux-64`, `linux-aarch64`, `osx-64`, and `osx-arm64`.
BioContainers also publishes the pinned workflow image
`quay.io/biocontainers/fastaguard:0.2.0--hfa8f182_0`.

The next internal milestone is the
[v0.2 evidence pack](docs/evidence/fastaguard-v0.2-evidence.md): reproducible
local and public FASTA runs that document runtime, verdicts, and top findings
before new biological profiles are added.
14 changes: 7 additions & 7 deletions docs/adoption-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,21 +8,21 @@ adding many new biological heuristics.
Priority:

```text
Bioconda published -> BioContainers confirmation -> MultiQC plugin -> public benchmarks -> upstream workflow examples
Bioconda published -> BioContainers available -> MultiQC plugin -> public benchmarks -> upstream workflow examples
```

## Phase 1: Package

Goal: make installation natural for bioinformatics users.

Status: Bioconda is live for FastaGuard v0.1.1, and the v0.2.0 recipe update
is ready with the published GitHub source archive SHA.
Status: Bioconda is live for FastaGuard v0.2.0 on Linux and macOS x86_64/ARM64
platforms. BioContainers publishes the pinned workflow image
`quay.io/biocontainers/fastaguard:0.2.0--hfa8f182_0`.

- Keep GitHub release binaries working.
- Keep Docker smoke tests passing.
- Keep `packaging/bioconda/` aligned with the upstream Bioconda recipe.
- Confirm BioContainers image/tag publication after the Bioconda merge;
current candidate tags are not yet visible in the registry.
- Keep workflow examples pinned to the confirmed BioContainers image tag.

Done when:

Expand All @@ -31,8 +31,8 @@ mamba install -c conda-forge -c bioconda fastaguard
fastaguard --schema
```

works in a clean environment. This has been verified locally on macOS for
v0.1.1; repeat this check for v0.2.0 after the Bioconda update merges.
works in a clean environment, and workflow engines can pull the pinned
BioContainers image.

## Phase 2: Aggregate

Expand Down
27 changes: 27 additions & 0 deletions docs/benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,3 +117,30 @@ For each run, record:
- whether downstream tools would have been blocked or recommended

This evidence matters more than synthetic speed alone because it shows the wedge: cheap FASTA preflight before expensive downstream QC.

## Evidence Pack Workflow

The v0.2 evidence workflow is documented in
`docs/evidence/fastaguard-v0.2-evidence.md`.

CI-safe local run:

```bash
python3 scripts/collect_evidence.py \
--binary target/release/fastaguard \
--out-dir target/evidence/local-smoke \
--local-only
```

Public NCBI run:

```bash
python3 scripts/collect_evidence.py \
--binary target/release/fastaguard \
--out-dir target/evidence/v0.2
```

The public run uses NCBI Datasets commands such as
`datasets download genome accession <ACCESSION> --include genome --filename <zip>`.
It writes compact `evidence_summary.json` and `evidence_summary.tsv` files while
leaving downloaded FASTA files and full reports under `target/`.
100 changes: 100 additions & 0 deletions docs/evidence/fastaguard-v0.2-evidence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# FastaGuard v0.2 Evidence Pack

This page records the evidence workflow for FastaGuard v0.2. It is intended to
make the preflight claim inspectable before adding new biological profiles.

FastaGuard is a FASTA preflight tool. It is not biological completeness
analysis, not assembly correctness analysis, and not contamination confirmation.
Passing FastaGuard means the FASTA-level contract is sane enough to route into
downstream tools such as QUAST, BUSCO, BlobToolKit, CheckM, or annotation.

## Local Evidence Run

Build the release binary:

```bash
cargo build --release --locked
```

Run the CI-safe local evidence path:

```bash
python3 scripts/collect_evidence.py \
--binary target/release/fastaguard \
--out-dir target/evidence/local-smoke \
--local-only
```

Local-only mode does not require network access or the NCBI Datasets CLI. It
runs:

- a deterministic synthetic FASTA
- `testdata/problem_assembly.fa`
- a gzipped copy of `testdata/valid_assembly.fa`

## Public NCBI Evidence Run

Install the NCBI Datasets CLI, then run:

```bash
python3 scripts/collect_evidence.py \
--binary target/release/fastaguard \
--out-dir target/evidence/v0.2
```

The public workflow downloads genomic FASTA packages with commands shaped like:

```bash
datasets download genome accession GCF_000005845.2 --include genome --filename target/evidence/v0.2/ecoli_k12_mg1655/ncbi_dataset.zip
```

If `datasets` is not installed, the script exits before running public cases.
Use `--local-only` for offline smoke tests.

The default public manifest is:

```text
docs/evidence/public_assemblies.json
```

It currently includes:

- `GCF_000005845.2`: Escherichia coli K-12 MG1655
- `GCF_000182925.2`: Neurospora crassa OR74A

## Outputs

Each case writes FastaGuard artifacts under `target/evidence/<case>/`:

- `fastaguard.json`
- `fastaguard.tsv`
- `fastaguard_report.html`
- `fastaguard_mqc.json`

The workflow also writes compact summaries:

- `evidence_summary.json`
- `evidence_summary.tsv`

The summary records the command used, FastaGuard version, git commit, platform,
date, input size, sequence count, elapsed seconds, verdict, and top findings.
Commit the evidence page and compact summaries when useful. Do not commit
downloaded FASTA files, NCBI zip archives, or full generated reports.

## Interpretation

Use this evidence to answer practical adoption questions:

- how quickly does FastaGuard produce a preflight report?
- does it catch duplicate IDs, invalid symbols, high-N records, and composition outliers?
- does it produce JSON, TSV, HTML, and MultiQC-ready output before heavier tools run?

Use QUAST, BUSCO, BlobToolKit, CheckM, sourmash, Kraken, or other tools for
deeper biological interpretation after FastaGuard has checked the FASTA-level
contract.

## References

- [NCBI Datasets genome download reference](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/download/genome/)
- [NCBI Datasets genome download guide](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/genomes/download-genome/)
- [Neurospora crassa OR74A BioProject](https://www.ncbi.nlm.nih.gov/bioproject/132)
19 changes: 19 additions & 0 deletions docs/evidence/public_assemblies.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"schema_version": 1,
"assemblies": [
{
"id": "ecoli_k12_mg1655",
"accession": "GCF_000005845.2",
"label": "Escherichia coli K-12 MG1655",
"category": "bacterial",
"source_url": "https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000005845.2/"
},
{
"id": "neurospora_crassa_or74a",
"accession": "GCF_000182925.2",
"label": "Neurospora crassa OR74A",
"category": "fungal",
"source_url": "https://www.ncbi.nlm.nih.gov/bioproject/132"
}
]
}
21 changes: 11 additions & 10 deletions docs/packaging.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,9 @@ Bioconda -> GitHub release binaries -> Docker image -> BioContainers -> Homebrew
```

FastaGuard v0.2.0 is published on GitHub with Linux and macOS release binaries.
Bioconda currently publishes v0.1.1 until the upstream v0.2.0 recipe update is
merged. Docker remains useful for local smoke tests and pipeline containers,
while BioContainers should be confirmed after the Bioconda publication pipeline
catches up.
Bioconda serves v0.2.0 on Linux and macOS x86_64/ARM64 platforms. Docker remains
useful for local smoke tests, while BioContainers now provides the pinned
workflow image generated from the Bioconda package.

## Bioconda

Expand All @@ -38,7 +37,7 @@ fastaguard --finding-catalog

Current published package:

- Version: `0.1.1`
- Version: `0.2.0`
- Platforms: `linux-64`, `linux-aarch64`, `osx-64`, `osx-arm64`
- Package page: [anaconda.org/bioconda/fastaguard](https://anaconda.org/bioconda/fastaguard)

Expand Down Expand Up @@ -141,12 +140,14 @@ The Docker image should stay boring:

That makes it easy to run in Nextflow, Snakemake, Galaxy, and CI systems.

The Bioconda recipe has merged upstream. BioContainers should be confirmed
separately by checking the generated registry image/tag once it appears. That
path is preferable to maintaining a separate BioContainers Dockerfile unless
automated container publication proves unavailable.
The Bioconda recipe has merged upstream and generated a BioContainers image.
Use the pinned tag in workflow examples:

BioContainers image availability is still pending confirmation.
```bash
docker pull quay.io/biocontainers/fastaguard:0.2.0--hfa8f182_0
```

That path is preferable to maintaining a separate BioContainers Dockerfile.

## MultiQC

Expand Down
9 changes: 7 additions & 2 deletions docs/releases/v0.2.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,19 @@ pipelines.

## Install

After the v0.2.0 Bioconda update merges, install it with:
Install the v0.2.0 Bioconda package with:

```bash
mamba install -c conda-forge -c bioconda fastaguard
```

The v0.2.0 GitHub release binaries and source archive are published. Bioconda
may still serve v0.1.1 until the upstream recipe update is merged.
serves v0.2.0 for Linux and macOS x86_64/ARM64 platforms. BioContainers also
publishes the pinned workflow image:

```bash
docker pull quay.io/biocontainers/fastaguard:0.2.0--hfa8f182_0
```

## Positioning

Expand Down
Loading