Skip to content

docxology/MetaInformAnt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

325 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

METAINFORMANT

Comprehensive bioinformatics toolkit for multi-omic analysis

Python 3.11+ License: Apache 2.0 Code style: black Modules Files


Overview

METAINFORMANT provides broad bioinformatics analysis modules across genomics, transcriptomics, proteomics, epigenomics, and systems biology. Built with Python 3.11+ and uv for fast dependency management.

At a Glance

Metric Value
Modules 28 specialized analysis modules
Python Files 650+ implementation files under src/metainformant/
Plot Types 70+ visualization methods
Documentation 450+ project-owned README.md and AGENTS.md files

Core Capabilities

Domain Features
DNA Sequences, alignment, phylogenetics, population genetics, variant analysis
RNA Amalgkit integration, ENA/SRA downloads, Kallisto quantification, industrial-scale pipelines (8,300+ samples across 28 species)
GWAS Association testing, fine-mapping, visualization, complete GWAS pipelines
eQTL Integration of GWAS variants and Amalgkit RNA-seq expression data
Multi-omics Cross-omic integration, joint PCA, correlation analysis
ML Classification, regression, feature selection, LLM integration
Visualization Manhattan plots, heatmaps, networks, animations, publication-ready output

System Architecture

flowchart TB
    subgraph coreInfra["Core Infrastructure"]
        CORE["Core Utilities"]
    end

    subgraph molecular["Molecular Analysis"]
        DNA["DNA Analysis"]
        RNA["RNA Analysis"]
        PROT["Protein Analysis"]
        EPI["Epigenome Analysis"]
    end

    subgraph statsML["Statistical and ML"]
        GWAS["GWAS Analysis"]
        MATH["Mathematical Biology"]
        ML["Machine Learning"]
        INFO["Information Theory"]
    end

    subgraph systems["Systems Biology"]
        NET["Network Analysis"]
        MULTI["Multi-Omics Integration"]
        SC["Single-Cell Analysis"]
        SIM["Simulation"]
    end

    subgraph annotation["Annotation and Metadata"]
        ONT["Ontology"]
        PHEN["Phenotype Analysis"]
        ECO["Ecology"]
        LE["Life Events"]
    end

    subgraph utilities["Utilities"]
        QUAL["Quality Control"]
        VIZ["Visualization"]
    end

    subgraph specialized["Specialized Domains"]
        LR["Long-Read Sequencing"]
        METAG["Metagenomics"]
        SV["Structural Variants"]
        SPATIAL["Spatial Transcriptomics"]
        PHARMA["Pharmacogenomics"]
        METAB["Metabolomics"]
        MENU["Menu System"]
        CLOUD["Cloud Deployment"]
    end

    CORE --> DNA
    CORE --> RNA
    CORE --> PROT
    CORE --> EPI
    CORE --> GWAS
    CORE --> MATH
    CORE --> ML
    CORE --> INFO
    CORE --> NET
    CORE --> MULTI
    CORE --> SC
    CORE --> SIM
    CORE --> ONT
    CORE --> PHEN
    CORE --> ECO
    CORE --> LE
    CORE --> QUAL
    CORE --> VIZ
    CORE --> LR
    CORE --> METAG
    CORE --> SV
    CORE --> SPATIAL
    CORE --> PHARMA
    CORE --> METAB
    CORE --> MENU
    CORE --> CLOUD
Loading

Data Flow and Integration Architecture

graph TD
    A["Raw Biological Data"] --> B["Data Ingestion"]
    B --> C{Data Type}

    C -->|DNA| D["DNA Module"]
    C -->|RNA| E["RNA Module"]
    C -->|Protein| F["Protein Module"]
    C -->|Epigenome| G["Epigenome Module"]
    C -->|Phenotype| H["Phenotype Module"]
    C -->|Environmental| I["Ecology Module"]

    D --> J["Quality Control"]
    E --> J
    F --> J
    G --> J
    H --> J
    I --> J

    J --> K["Core Processing"]
    K --> L{Analysis Type}

    L -->|Statistical| M["GWAS Module"]
    L -->|ML| N["ML Module"]
    L -->|Information| O["Information Module"]
    L -->|Networks| P["Networks Module"]
    L -->|Systems| Q["Multi-Omics Module"]
    L -->|Singlecell| R["Single-Cell Module"]
    L -->|Simulation| S["Simulation Module"]

    M --> T["Results Integration"]
    N --> T
    O --> T
    P --> T
    Q --> T
    R --> T
    S --> T

    T --> U["Visualization"]
    U --> V["Publication Figures"]
    V --> W["Scientific Insights"]

    subgraph "Primary Data Types"
        X["Genomic"] -.-> D
        Y["Transcriptomic"] -.-> E
        Z["Proteomic"] -.-> F
        AA["Epigenetic"] -.-> G
    end

    subgraph "Analysis Workflows"
        BB["Population Genetics"] -.-> M
        CC["Feature Selection"] -.-> N
        DD["Mutual Information"] -.-> O
        EE["Community Detection"] -.-> P
        FF["Joint PCA"] -.-> Q
        GG["Trajectory Analysis"] -.-> R
    end

    subgraph "Output Formats"
        HH["Manhattan Plots"] -.-> V
        II["Heatmaps"] -.-> V
        JJ["Network Graphs"] -.-> V
        KK["Animations"] -.-> V
    end
Loading

Multi-Omic Integration Pipeline

graph TD
    A["Multi-Omic Datasets"] --> B["Sample Alignment"]
    B --> C["Batch Effect Correction"]

    C --> D{Integration Strategy}
    D -->|Early| E["Concatenated Matrix"]
    D -->|Late| F["Separate Models"]
    D -->|Intermediate| G["Meta-Analysis"]

    E --> H["Joint Dimensionality Reduction"]
    F --> I["Individual Analysis"]
    G --> J["Result Integration"]

    H --> K["Unified Clustering"]
    I --> L["Individual Clustering"]
    J --> M["Consensus Clustering"]

    K --> N["Functional Enrichment"]
    L --> N
    M --> N

    N --> O["Pathway Analysis"]
    O --> P["Network Construction"]

    P --> Q["Biological Interpretation"]
    Q --> R["Systems Biology Insights"]

    subgraph "Omic Layers"
        S["Genomics"] -.-> A
        T["Transcriptomics"] -.-> A
        U["Proteomics"] -.-> A
        V["Metabolomics"] -.-> A
        W["Epigenomics"] -.-> A
    end

    subgraph "Integration Methods"
        X["MOFA"] -.-> H
        Y["Joint PCA"] -.-> H
        Z["Similarity Networks"] -.-> H
    end

    subgraph "Biological Outputs"
        AA["Gene Modules"] -.-> Q
        BB["Regulatory Networks"] -.-> Q
        CC["Disease Pathways"] -.-> Q
        DD["Biomarkers"] -.-> Q
    end
Loading

Quality Assurance Framework

graph TD
    A["Data Processing Pipeline"] --> B["Input Validation"]
    B --> C["Type Checking"]
    C --> D["Schema Validation"]

    D --> E["Processing Logic"]
    E --> F["Error Handling"]
    F --> G["Recovery Mechanisms"]

    G --> H["Output Validation"]
    H --> I["Result Verification"]
    I --> J["Quality Metrics"]

    J --> K{Acceptable Quality?}
    K -->|Yes| L["Pipeline Success"]
    K -->|No| M["Quality Issues"]

    M --> N["Diagnostic Analysis"]
    N --> O["Error Classification"]

    O --> P{Recoverable?}
    P -->|Yes| Q["Data Correction"]
    P -->|No| R["Pipeline Failure"]

    Q --> E
    L --> S["Validated Results"]
    R --> T["Error Reporting"]

    subgraph "Validation Layers"
        U["Data Integrity"] -.-> B
        V["Business Logic"] -.-> E
        W["Statistical Validity"] -.-> H
    end

    subgraph "Quality Controls"
        X["Unit Tests"] -.-> F
        Y["Integration Tests"] -.-> I
        Z["Performance Benchmarks"] -.-> J
    end

    subgraph "Error Types"
        AA["Data Errors"] -.-> O
        BB["Logic Errors"] -.-> O
        CC["System Errors"] -.-> O
        DD["External Errors"] -.-> O
    end
Loading

Key Features

  • Multi-Omic Analysis: DNA, RNA, protein, and epigenome data integration
  • Statistical & ML Methods: GWAS, population genetics, machine learning pipelines
  • Single-Cell Genomics: Complete scRNA-seq analysis workflows
  • Network Analysis: Biological networks, pathways, community detection algorithms
  • Visualization Suite: 14 specialized plotting modules with 70+ plot types and publication-quality output
  • Modular Architecture: Individual modules or complete end-to-end workflows
  • Comprehensive Documentation: Repo-wide README, AGENTS, SPEC, and task guides with current signposting
  • Implementation Testing: Real methods in tests, real implementations with explicit unsupported-feature errors
  • Quality Assurance: Rigorous validation and error handling throughout
  • Performance Optimization: Efficient algorithms for large-scale biological data

Current Validation Snapshot

As of the 2026-05-25 stabilization pass, this checkout collects 7,736 tests and the local non-network/non-external suite passes (7,495 passed, 71 skipped, 170 deselected). Root-level audit and validation reports are retained as historical snapshots; regenerate current verification outputs under output/.

Quick Start

I Want To...

Analyze DNA sequences:

# One-liner: GC content for a short sequence
uv run python - <<'PY'
from metainformant.dna.sequence.composition import gc_content

seq = "ATGCGC"
print(f"GC: {gc_content(seq) * 100:.1f}%")
PY

Run RNA-seq pipeline (amalgkit):

# List available species configs before running an amalgkit workflow
uv run python scripts/rna/run_workflow.py --list-configs

Perform GWAS analysis:

# End-to-end Apis mellifera GWAS workflow
uv run python scripts/gwas/run_amellifera_gwas.py \
  --config config/gwas/gwas_amellifera.yaml \
  --output output/gwas/amellifera

Visualize results:

import numpy as np
from metainformant.visualization.plots.basic import heatmap

ax = heatmap(np.array([[1.0, 0.5], [0.5, 1.0]]), output_path="output/figures/heatmap.png")

Deploy to cloud (GCP):

# Inspect the GCP deployment subcommands
uv run python scripts/cloud/deploy_gcp.py --help

Choosing the Right Module

Your Data Type Use This Module Start Here
DNA sequences (FASTA) dna docs/dna/
RNA-seq (FASTQ, BAM) rna (amalgkit) docs/rna/
VCF + phenotypes gwas docs/gwas/workflow.md
Protein (FASTA, PDB) protein docs/protein/
Single-cell (h5ad, mtx) singlecell docs/singlecell/
Methylation arrays/bams epigenome docs/epigenome/
Microbiome (16S, metagenome) metagenomics docs/metagenomics/
Multiple omics (joint analysis) multiomics docs/multiomics/
Gene lists + GO terms ontology docs/ontology/
Phenotype traits phenotype docs/phenotype/
Ecological communities ecology docs/ecology/
Long-read (PacBio/ONT) longread docs/longread/
Networks & pathways networks docs/networks/
Information theory analysis information docs/information/
Simulation/synthetic data simulation docs/simulation/
Visualizations only visualization docs/visualization/
GCP cloud deployment cloud src/metainformant/cloud/README.md

Not sure? Read the full module matrix.


First-Time Visitor Path

  1. Install (10 min): Follow QUICKSTART.md
  2. Run demo (2 min): python3 scripts/core/run_demo.py
  3. Pick your domain: See table above → click module link
  4. Read workflow guide: Each module's docs/<module>/workflow.md
  5. Try on sample data: Each module has tests/data/<module>/ examples
  6. Run on your data: Replace sample paths with your files

Module Signposting

The package is intentionally broad. Treat each module's source, tests, and local README/SPEC files as the source of truth for current behavior.

Area Packages
Core and utilities core, quality, visualization, menu, cloud
Molecular omics dna, rna, protein, epigenome, longread, structural_variants
Higher-order omics singlecell, spatial, multiomics, metabolomics, metagenomics, pharmacogenomics
Analysis and methods gwas, ml, networks, simulation, math, information
Annotation and ecology ontology, phenotype, ecology, life_events
Protocol helpers mcp currently provides a standalone Amalgkit monitor; no MCP server is implemented

Module Overview

Complete Module Reference

All modules live in src/metainformant/ with documentation in each module's README.md.

Module Files Description Key Components Docs
Core Infrastructure
core/ 37 Shared utilities, I/O, logging, config, parallel processing, caching io/, data/, execution/ README
Molecular Analysis
dna/ 47 DNA sequences, alignment, phylogenetics, population genetics, variants sequence/, alignment/, population/ README
rna/ 57 RNA-seq workflows, amalgkit integration, expression quantification amalgkit/, engine/, analysis/ README
protein/ 27 Protein sequences, structure analysis, AlphaFold, UniProt integration sequence/, structure/, database/ README
epigenome/ 15 Methylation analysis, ChIP-seq, ATAC-seq, chromatin accessibility assays/, chromatin_state/, peak_calling/ README
Statistical & ML
gwas/ 78 GWAS, fine-mapping, eQTL analysis, colocalization, visualization finemapping/, visualization/, analysis/ README
math/ 29 Population genetics theory, coalescent, selection, epidemiology population_genetics/, epidemiology/, evolutionary_dynamics/ README
ml/ 22 Machine learning pipelines, classification, regression, features models/, features/, llm/ README
information/ 24 Information theory, Shannon entropy, mutual information, semantic similarity metrics/, integration/ README
Systems Biology
networks/ 20 Biological networks, graph algorithms, community detection, pathways analysis/, interaction/ README
multiomics/ 12 Multi-omic integration, joint PCA, cross-omic correlation analysis/, methods/ README
singlecell/ 21 scRNA-seq preprocessing, clustering, differential expression data/, analysis/, visualization/ README
simulation/ 14 Synthetic data, agent-based models, sequence simulation, ecosystems models/, workflow/, benchmark/ README
Annotation & Metadata
ontology/ 19 Gene Ontology, functional annotation, semantic similarity core/, query/, visualization/ README
phenotype/ 30 Phenotypic data curation, AntWiki integration, trait analysis analysis/, data/, behavior/ README
ecology/ 13 Community diversity, environmental correlations, species matrices analysis/, phylogenetic/, visualization/ README
life_events/ 20 Life course analysis, event sequences, temporal embeddings models/, workflow/ README
Utilities
quality/ 10 FASTQ quality assessment, validation, contamination detection io/, analysis/, reporting/ README
visualization/ 30 70+ plot types, heatmaps, networks, animations, publication-ready plots/, genomics/, analysis/ README
Specialized Domains
longread/ 31 Long-read sequencing (PacBio, ONT), assembly, error correction assembly/, quality/ README
metagenomics/ 18 Metagenomic analysis, taxonomic profiling, functional annotation amplicon/, functional/ README
pharmacogenomics/ 19 Drug-gene interactions, pharmacokinetics, variant interpretation interaction/ README
spatial/ 20 Spatial transcriptomics, tissue mapping, spatial statistics analysis/ README
structural_variants/ 15 SV detection, CNV analysis, breakpoint resolution detection/ README
metabolomics/ 9 Metabolomic analysis, MS data processing, pathway mapping analysis/ README
cloud/ 3 Cloud deployment helpers, Docker/GCP workflow utilities deployment/ README
mcp/ 3 Standalone helper tools for future MCP integration tools/ README
menu/ 7 Interactive CLI menu system, workflow navigation ui/ README

Total: 28 package directories, 650+ Python files

Documentation

Quick Links

Transcriptomics (RNA-seq)

Module Documentation

Each module has documentation in src/metainformant/<module>/README.md and docs/<module>/.

Scripts & Workflows

The scripts/ directory contains workflow orchestrators and utilities:

  • Package Management: Setup, testing, quality control
  • RNA-seq (Amalgkit): Multi-species workflows, amalgkit integration
  • GWAS (Variants): Genome-scale association studies
  • eQTL Integration: RNA-seq + Variant cross-omics integration pipelines
  • Module Orchestrators: Complete workflow scripts for all domains (core, DNA, RNA, protein, networks, multiomics, single-cell, quality, simulation, visualization, epigenome, ecology, ontology, phenotype, ML, math, gwas, information, life_events)

See scripts/README.md for documentation.

CLI Interface

The metainformant command exposes a focused CLI (docs/cli.md): --version, --modules, protein utilities, quality checks, rna info, and gwas run. RNA workflows use Python imports, scripts/rna/run_workflow.py, or python -m metainformant.rna.amalgkit.

uv run metainformant --help
uv run metainformant --modules
uv run metainformant protein taxon-ids --file data/taxon_ids.txt
uv run metainformant protein comp --fasta data/proteins.fasta
uv run metainformant protein rmsd-ca --pdb-a data/structure1.pdb --pdb-b data/structure2.pdb
uv run metainformant quality batch-detect --data samples.csv --batches batches.txt
uv run metainformant gwas run --config config/gwas/gwas_pbarbatus.yaml --check

# RNA-seq workflow config discovery
uv run python scripts/rna/run_workflow.py --list-configs

See docs/cli.md for CLI documentation.

Usage Examples

DNA Analysis

from metainformant.dna.alignment.pairwise import global_align
from metainformant.dna.population import nucleotide_diversity

alignment = global_align("ACGTACGT", "ACGTAGGT")
print(f"Alignment score: {alignment.score}")

seqs = ["ATCGATCG", "ATCGTTCG", "ATCGATCG"]
print(f"Nucleotide diversity: {nucleotide_diversity(seqs):.4f}")

RNA-seq Workflow

from metainformant.rna.engine.workflow import load_workflow_config, plan_workflow

config = load_workflow_config("config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml")
for step_name, _params in plan_workflow(config):
    print(step_name)
# Inspect available species configs, then run a workflow after amalgkit is installed
uv run python scripts/rna/run_workflow.py --list-configs
uv run python scripts/rna/run_workflow.py --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml

GWAS Analysis

from metainformant.gwas.analysis.association import association_test_linear

result = association_test_linear(
    genotypes=[0, 1, 2, 0, 1, 2, 0, 1],
    phenotypes=[10.1, 11.0, 12.2, 9.8, 10.9, 12.0, 10.0, 11.2],
)
print(result["beta"], result["p_value"])
uv run python scripts/gwas/run_amellifera_gwas.py --config config/gwas/gwas_amellifera.yaml --output output/gwas/amellifera

Configuration

from metainformant.core.utils.config import apply_env_overrides, load_mapping_from_file

config = load_mapping_from_file("config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml")
config = apply_env_overrides(config, prefix="AK")

Visualization

import numpy as np
from metainformant.visualization.plots.basic import heatmap

heatmap(np.array([[1.0, 0.2], [0.2, 1.0]]), output_path="output/figures/correlation.png")

Core Utilities

from metainformant.core import io
from metainformant.core.io import paths
from metainformant.core.utils.logging import get_logger

logger = get_logger(__name__)
resolved = paths.expand_and_resolve("output/results.json")
io.dump_json({"ok": True}, resolved)
logger.info("Wrote %s", resolved)

Project Structure

MetaInformAnt/
 src/metainformant/ # Main package
 core/ # Core utilities
 dna/ # DNA analysis
 rna/ # RNA analysis
 protein/ # Protein analysis
 gwas/ # GWAS analysis
 ... # Additional modules
 scripts/ # Workflow scripts
 package/ # Package management
 rna/ # RNA workflows
 gwas/ # GWAS workflows
 ... # Module scripts
 docs/ # Documentation
 tests/ # Test suite
 config/ # Configuration files
 output/ # Analysis outputs
 data/ # Input data

AI-Assisted Development

This project uses AI assistance to enhance:

  • Code generation and algorithm implementation
  • Comprehensive documentation
  • Test case generation
  • Architecture design

All AI-generated content undergoes human review. See AGENTS.md for details.

Known Limitations

Module Completeness

Some modules have partial implementations or optional dependencies:

  • Machine Learning: Framework exists; some methods may need completion (see ML Documentation)
  • Multi-omics: Integration methods implemented; additional dependencies may be required
  • Single-cell: Requires scipy, scanpy, anndata (see Single-Cell Documentation)
  • Network Analysis: Algorithms implemented; regulatory network features may need enhancement

GWAS Module

  • Variant Download: Database download (dbSNP, 1000 Genomes) is a placeholder; use SRA-based workflow or provide VCF files
  • Functional Annotation: Requires external tools (ANNOVAR, VEP, SnpEff) for variant annotation
  • Mixed Models: Relatedness adjustment implemented; MLM methods may require GCTA/EMMAX integration

Test Coverage

Some modules have lower test success rates due to optional dependencies:

  • Single-cell: Requires scientific dependencies (scanpy, anndata)
  • Multi-omics: Framework exists, tests may skip without dependencies
  • Network Analysis: Tests pass; features may need additional setup

See Testing Guide for detailed testing documentation and coverage information.

Best Practices

File Naming

  • Use informative names: sample_pca_biplot_colored_by_treatment.png
  • Avoid generic names: plot1.png, output.png

Output Organization

  • All outputs in output/ directory
  • Configuration saved with results
  • Visualizations in subdirectories with metadata

Real Implementation Policy

  • All tests use implementations
  • No test-double or inert placeholder methods
  • Real API calls or graceful skips
  • Ensures actual functionality

Requirements

  • Python 3.11+
  • Optional: SRA Toolkit, kallisto (for RNA workflows)
  • Optional: samtools, bcftools, bwa (for GWAS)

Contributing

See CONTRIBUTING.md for full contribution guidelines.

Contributions are welcome! Please:

  1. Follow the existing code style
  2. Add tests for new features
  3. Update documentation
  4. Use informative commit messages

Recent Improvements

Performance Enhancements

  • Intelligent Caching: Automatic caching for expensive computations (Tajima's constants, entropy calculations)
  • NumPy Vectorization: Optimized mathematical operations for 10-100x performance improvements
  • Progress Tracking: Real-time progress bars for long-running analyses
  • Memory Optimization: Efficient algorithms for large datasets
  • Resilient Orchestration: Engineered automatic recovery flows and VM-level hard reset protocols to survive catastrophic 100% Docker overlay lockups caused by hidden fasterq-dump caches.

Enhanced Documentation

  • Comprehensive Tutorials: End-to-end guides for DNA, RNA, GWAS, and information theory workflows
  • Method Comparison Guides: Decision-making guides for choosing analysis algorithms
  • Extended FAQ: Troubleshooting and usage guidance for common scenarios
  • Standardized Docstrings: Consistent formatting with examples and DOI citations

Testing & Reliability

  • Expanded Test Coverage: 37+ new comprehensive tests with real implementations
  • Validation Enhancements: Improved parameter validation and error handling
  • Cross-Platform Compatibility: Python 3.14 support and external drive optimization
  • Integration Testing: Verified cross-module functionality

New Features

  • Enhanced GWAS Visualization: Complete visualization suite for population structure, effects, and comparisons
  • Information Theory Workflows: Batch processing with progress tracking
  • Protein Proteome Analysis: Taxonomy ID processing and proteome utilities
  • Advanced Error Handling: Structured error reporting with actionable guidance

Citation

If you use METAINFORMANT in your research, please cite this repository:

@software{metainformant2025,
  author = {MetaInformAnt Development Team},
  title = {MetaInformAnt: Comprehensive Bioinformatics Toolkit},
  year = {2025},
  url = {https://github.com/docxology/MetaInformAnt},
  version = {0.2.6}
}

License

This project is licensed under the Apache License, Version 2.0 - see LICENSE for details.

Contact

Acknowledgments

  • Developed with AI assistance from Cursor's Code Assistant (grok-code-fast-1)
  • Built on established bioinformatics tools and libraries
  • Community contributions and feedback

Status: Active Development | Version: 0.2.6 | Python: 3.11+ | License: Apache 2.0

Releases

No releases published

Packages

 
 
 

Contributors