Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
name: CI

env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true

on:
push:
branches:
Expand All @@ -13,7 +16,7 @@ jobs:

steps:
- name: Checkout
uses: actions/checkout@v4
uses: actions/checkout@v5

- name: Install Rust
uses: dtolnay/rust-toolchain@stable
Expand All @@ -32,6 +35,9 @@ jobs:
- name: Run clippy
run: cargo clippy --locked --all-targets --all-features -- -D warnings

- name: Check adoption assets
run: python3 -m unittest tests.python.test_adoption_assets -v

- name: Build Docker image
run: docker build -t fastaguard:ci .

Expand Down
9 changes: 6 additions & 3 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
name: Release

env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true

on:
push:
tags:
Expand All @@ -25,7 +28,7 @@ jobs:

steps:
- name: Checkout
uses: actions/checkout@v4
uses: actions/checkout@v5

- name: Install Rust
uses: dtolnay/rust-toolchain@stable
Expand All @@ -45,7 +48,7 @@ jobs:
run: scripts/package_release_artifact.sh "${{ matrix.target }}" "${GITHUB_REF_NAME}"

- name: Upload artifact
uses: actions/upload-artifact@v4
uses: actions/upload-artifact@v5
with:
name: fastaguard-${{ github.ref_name }}-${{ matrix.target }}
path: dist/*.tar.gz
Expand All @@ -57,7 +60,7 @@ jobs:
needs: build
steps:
- name: Checkout
uses: actions/checkout@v4
uses: actions/checkout@v5

- name: Download artifacts
uses: actions/download-artifact@v5
Expand Down
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2026 Ehsan Estaji

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,8 @@ FastaGuard catches FASTA-level assembly problems before expensive assembly QC.
- [Product thesis](docs/product-thesis.md)
- [MVP spec](docs/mvp-spec.md)
- [Output contract](docs/output-contract.md)
- [Tool landscape](docs/tool-landscape.md)
- [Adoption plan](docs/adoption-plan.md)
- [LLM and tooling vision](docs/llm-tooling-vision.md)
- [Benchmarking](docs/benchmarking.md)
- [Packaging](docs/packaging.md)
Expand Down
72 changes: 72 additions & 0 deletions docs/adoption-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Adoption Plan

## Recommendation

The next product phase should focus on installability and pipeline trust before
adding many new biological heuristics.

Priority:

```text
Bioconda -> BioContainers -> MultiQC plugin -> public benchmarks -> upstream workflow examples
```

## Phase 1: Package

Goal: make installation natural for bioinformatics users.

- Keep GitHub release binaries working.
- Keep Docker smoke tests passing.
- Replace the Bioconda recipe placeholder SHA256 once a public source archive exists.
- Submit `packaging/bioconda/` as `recipes/fastaguard/` to Bioconda.
- Let BioContainers build from the merged Bioconda recipe.

Done when:

```bash
conda install -c bioconda fastaguard
fastaguard --schema
```

works in a clean environment.

## Phase 2: Aggregate

Goal: make FastaGuard visible in standard pipeline reports.

- Continue emitting `fastaguard_mqc.json` custom content.
- Develop `integrations/multiqc/` into a packaged MultiQC plugin.
- Test the plugin against multiple sample reports.
- Decide whether to submit upstream to MultiQC once public adoption begins.

Done when:

```bash
multiqc .
```

shows FastaGuard verdicts and key metrics across many samples.

## Phase 3: Prove

Goal: show why FastaGuard is worth adding before expensive tools.

- Benchmark public FASTA files.
- Capture examples of duplicate IDs, invalid symbols, high-N scaffolds, and suspicious composition.
- Document which findings should block downstream tools and which should only recommend deeper QC.
- Create a concise comparison against `seqkit stats`, QUAST, BUSCO, BlobToolKit, FastQC, and MultiQC.

Done when the README can show real examples rather than only promises.

## Phase 4: Expand

Goal: add profiles once the assembly preflight contract is trusted.

- transcriptome profile
- protein profile
- reference-panel profile
- compare mode for many FASTA files
- richer anomaly evidence
- LLM/tool-agent affordances on top of stable JSON and finding catalogs

Avoid expanding profiles before packaging and benchmarks are credible.
22 changes: 22 additions & 0 deletions docs/benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,3 +80,25 @@ Use it to answer:
- does the tool still behave well on large record counts?

Do not use it to claim performance on contaminated assemblies, highly ambiguous assemblies, or compressed FASTA until separate fixtures cover those cases.

## Evidence To Collect Next

Use release binaries and public assemblies to build a small evidence table for the README and release notes:

- bacterial assembly around 5 Mbp
- fungal or small eukaryotic assembly around 30-50 Mbp
- large fragmented assembly with many contigs
- gzipped FASTA input
- intentionally problematic FASTA fixture with duplicate IDs and high-N scaffolds

For each run, record:

- FastaGuard version
- platform
- input size and sequence count
- elapsed seconds
- peak memory if measured externally
- verdict and top findings
- whether downstream tools would have been blocked or recommended

This evidence matters more than synthetic speed alone because it shows the wedge: cheap FASTA preflight before expensive downstream QC.
39 changes: 38 additions & 1 deletion docs/packaging.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,13 @@ For the first public release:

## Bioconda

Bioconda should be added after the first public source archive is available. The recipe should expose one executable:
Bioconda should be submitted after the first public source archive is available. A starter recipe now lives in:

```text
packaging/bioconda/
```

The recipe should expose one executable:

```text
fastaguard
Expand All @@ -91,6 +97,14 @@ Do not block the MVP on Bioconda, but design for it now:
- maintain stable exit codes
- maintain a versioned JSON Schema

Important current blocker: the GitHub repository is private. Before upstream Bioconda submission, make the source archive public or move the final recipe to a public source URL and replace the placeholder SHA256 in `packaging/bioconda/meta.yaml`.

Bioconda recipe guidance checked for this setup:

- Bioconda hosts bioinformatics-specific packages.
- Rust dependencies should have license metadata bundled, so the starter recipe uses `cargo-bundle-licenses`.
- Tests in `meta.yaml` must rely only on runtime dependencies, so the starter tests use FastaGuard contract discovery commands.

## Container Strategy

The Docker image should stay boring:
Expand All @@ -101,3 +115,26 @@ The Docker image should stay boring:
- one entrypoint: `fastaguard`

That makes it easy to run in Nextflow, Snakemake, Galaxy, and CI systems.

Once the Bioconda recipe is merged upstream, BioContainers can build the corresponding container from the conda recipe. That path is preferable to maintaining a separate BioContainers Dockerfile unless Bioconda packaging proves impossible.

## MultiQC

FastaGuard v0.1.0 emits MultiQC custom content as `fastaguard_mqc.json`.

A native MultiQC plugin starter now lives in:

```text
integrations/multiqc/
```

Local development:

```bash
cd integrations/multiqc
python -m pip install -e .
cd ../../examples/reports
multiqc .
```

This is intentionally compact: it parses `fastaguard_mqc.json`, adds key metrics to MultiQC general stats, and adds a FastaGuard summary section. The full evidence remains in FastaGuard's own HTML and JSON reports.
83 changes: 83 additions & 0 deletions docs/tool-landscape.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Tool Landscape

## Positioning

FastaGuard should not compete with established downstream tools. It should make
their inputs safer and easier to triage.

Recommended slogan:

```text
Run FastaGuard first.
```

Long-form positioning:

```text
The FASTA preflight QC layer for modern bioinformatics pipelines.
```

## Where FastaGuard Fits

| Tool | Primary role | When it runs | What FastaGuard adds before it |
| --- | --- | --- | --- |
| FastQC | Raw read QC | Before assembly or mapping | FastaGuard targets FASTA assemblies/references, not read files |
| seqkit | General sequence toolkit | Any ad hoc sequence operation | FastaGuard turns common FASTA checks into one opinionated QC contract |
| QUAST | Assembly quality evaluation | After assembly | FastaGuard catches structural FASTA problems before assembly QC |
| BUSCO | Completeness assessment | After assembly/transcriptome/protein prediction | FastaGuard checks parseability and composition before biological completeness |
| BlobToolKit | Contamination/cobiont exploration | After assembly and supporting evidence | FastaGuard flags FASTA-level anomalies before taxonomy workflows |
| MultiQC | Report aggregation | End of pipelines | FastaGuard emits data MultiQC can aggregate |
| Custom scripts | Pipeline-specific checks | Anywhere | FastaGuard replaces fragile repeated scripts with a versioned schema |

## The Gap

Without FastaGuard, users typically combine several partial checks:

- run `seqkit stats` for counts and lengths
- run custom scripts for duplicate IDs or invalid symbols
- run QUAST for assembly metrics
- run BUSCO for biological completeness
- run BlobToolKit or taxonomy tooling for contamination exploration
- rely on pipeline-specific assumptions for exit codes and report parsing

That works, but it is fragmented. The missing layer is a default, explainable,
machine-readable FASTA preflight contract.

## Product Evidence We Have

Current v0.1.0 evidence:

- Rust CLI builds and runs as a single binary.
- Docker build and smoke test pass.
- GitHub release workflow builds Linux and macOS binaries.
- JSON Schema validates committed golden reports.
- Reports include bounded evidence records and suggested actions.
- MultiQC custom-content JSON is emitted as `fastaguard_mqc.json`.
- A native MultiQC plugin starter exists under `integrations/multiqc/`.
- Bioconda recipe scaffolding exists under `packaging/bioconda/`.
- nf-core, Nextflow, and Snakemake starters exist under `examples/`.

Evidence still needed:

- benchmarks on public assemblies
- user feedback from real pipeline authors
- Bioconda/BioContainers availability
- official MultiQC module or packaged plugin
- comparison examples showing what FastaGuard catches before QUAST/BUSCO/BlobToolKit

## Message Discipline

Say:

```text
FastaGuard catches FASTA-level problems before expensive downstream QC.
```

Do not say:

```text
FastQC for FASTA.
```

That phrase is tempting, but it hides the more important product idea:
FastaGuard is a pipeline-native preflight contract, not just a report.
29 changes: 29 additions & 0 deletions integrations/multiqc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# MultiQC FastaGuard Module Starter

This directory contains a dedicated MultiQC plugin starter for FastaGuard.

FastaGuard already emits MultiQC custom-content JSON as `fastaguard_mqc.json`.
This plugin is the next step: a native module that can add FastaGuard verdicts
and key assembly preflight metrics directly to MultiQC reports.

## Local Install

From this directory:

```bash
python -m pip install -e .
cd path/to/fastaguard/results
multiqc .
```

The plugin looks for `*fastaguard_mqc.json` files and reads the same custom
content contract emitted by the CLI.

## Current Scope

- Parse FastaGuard custom-content JSON.
- Add verdict and summary metrics to the MultiQC general stats table.
- Add one FastaGuard summary table section.

Keep the module compact. MultiQC should summarize many FastaGuard reports, not
replicate every field from the full FastaGuard HTML report.
20 changes: 20 additions & 0 deletions integrations/multiqc/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "multiqc-fastaguard"
version = "0.1.0"
description = "MultiQC module for FastaGuard FASTA preflight reports"
readme = "README.md"
requires-python = ">=3.10"
license = "MIT"
dependencies = [
"multiqc>=1.28",
]

[project.entry-points."multiqc.modules.v1"]
fastaguard = "fastaguard_multiqc:MultiqcModule"

[tool.hatch.build.targets.wheel]
packages = ["src/fastaguard_multiqc"]
13 changes: 13 additions & 0 deletions integrations/multiqc/src/fastaguard_multiqc/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
"""MultiQC plugin starter for FastaGuard."""

from .parser import load_custom_content_summary

__all__ = ["MultiqcModule", "load_custom_content_summary"]


def __getattr__(name):
if name == "MultiqcModule":
from .multiqc_module import MultiqcModule

return MultiqcModule
raise AttributeError(name)
Loading