hfx-tools

Tools for working with HFX submissions (Haplotype Frequency Exchange).

This repo provides composable command line tools and a Streamlit app for building, packing, inspecting, and validating HFX documents, implementing the HFX specification. Key features include:

build - Build HFX bundles from a folder structure with automatic validation
pack - Pack HFX archives from metadata.json with optional manifests and checksums
qc - Compute quality control statistics
inspect - Inspect metadata or bundled HFX files
Validation framework - Extensible validation with built-in validators
Streamlit UI - Web-based interface for building HFX files

Key schema facts

metadata.frequencyLocation controls where frequencies are stored: either "inline" or a URI (e.g., file://data/frequencies.csv) (see HFX specification).
If inline, the JSON may include frequencyData (array of {haplotype, frequency}).

Install

Basic installation

python -m venv .venv
source .venv/bin/activate
pip install -e .

With optional dependencies

# For Parquet support
pip install -e ".[parquet]"

# For Streamlit web UI
pip install -e ".[streamlit]"

# For development
pip install -e ".[dev,lint]"

Quick Start

5-minute walkthrough for the most common workflow:

# 1. Create input folder with metadata and data
mkdir -p my_submission/{metadata,data}
cp my_metadata.json my_submission/metadata/metadata.json
cp my_frequencies.csv my_submission/data/frequencies.csv

# 2. Build and validate
hfx-build my_submission -n my_hfx_file

# 3. Done! Check output
ls -la my_submission/my_hfx_file.hfx
cat my_submission/my_hfx_file.build.log

For a guided interactive experience, launch the Streamlit web UI:

streamlit run hfx_tools/streamlit_app.py

Architecture

hfx-tools follows a layered architecture:

CLI / Streamlit UI (user-facing)
    ↓
build.py (orchestration)
    ↓
validators.py (validation rules) ← pack.py (packing logic)
    ↓
io.py (file I/O, JSON parsing)

CLI layer (cli.py) - Parses command-line arguments and delegates to build/pack/inspect/qc
Build orchestration (build.py) - High-level workflow: reads metadata → detects files → validates → packs
Validation framework (validators.py) - Pluggable validators for extensibility
Packing logic (pack.py) - Low-level archive creation (ZIP with metadata, data, optional manifest)
I/O utilities (io.py) - JSON parsing, file reading, consistent error handling

This design allows hackathon participants to:

Use the CLI for quick workflows
Call build() directly from Python for programmatic use
Register custom validators without modifying core code
Extend with custom QC statistics

Usage

Frequency Location Types

The HFX standard supports four types of frequency data locations:

Inline - "frequencyLocation": "inline" with frequencyData array in same JSON
Remote - "frequencyLocation": "https://zenodo.org/.../data.csv" or S3 URL
File (relative) - "frequencyLocation": "file://data/frequencies.csv" pointing to file within HFX bundle
File (parquet) - Same as above but with .parquet extension

CLI: Build from folder

The most common workflow for hackathons and batch processing:

hfx-build /path/to/input_folder -n output_name

This will:

Read metadata from input_folder/metadata/
Auto-detect frequency data files in input_folder/data/
Auto-update metadata.frequencyLocation to file://data/<filename> (unless already set to remote or inline)
Validate all data with built-in validators
Pack into a single output_name.hfx file (self-contained bundle with all data)
Log all validation results to output_name.build.log

Expected folder structure:

input_folder/
├── metadata/
│   └── metadata.json      # Required: HFX metadata + inline data (optional)
└── data/
    └── frequencies.csv    # Optional: if frequencyLocation = "file://frequencies.csv"

Example:

mkdir -p example/{metadata,data}
cp metadata.json example/metadata/
cp frequencies.csv example/data/
hfx-build example -n my_submission
# Output: example/my_submission.hfx

Options:

-n, --name NAME - Output filename (required, without .hfx)
-o, --out DIR - Output directory (defaults to input folder)
--no-manifest - Skip MANIFEST.json in archive
--hash {md5,sha256,none} - Hash algorithm (default: sha256)
--no-auto-update-location - Don't auto-update metadata.frequencyLocation (advanced)

CLI: Build with multiple populations

For submissions with multiple populations, use a flat naming scheme with a POPULATIONS.json manifest. This keeps the structure identical whether you have 1 or many populations:

input_folder/
├── metadata/
│   ├── pop1_metadata.json
│   ├── pop2_metadata.json
│   └── pop3_metadata.json
├── data/
│   ├── pop1_frequencies.csv
│   ├── pop2_frequencies.csv
│   └── pop3_frequencies.csv
└── POPULATIONS.json

POPULATIONS.json structure:

{
  "populations": [
    {
      "id": "pop1",
      "name": "Population 1",
      "metadataFile": "pop1_metadata.json",
      "frequencyFile": "pop1_frequencies.csv"
    },
    {
      "id": "pop2",
      "name": "Population 2",
      "metadataFile": "pop2_metadata.json",
      "frequencyFile": "pop2_frequencies.csv"
    }
  ]
}

Build command:

hfx-build input_folder -n multi_population_submission
# Generates: multi_population_submission.hfx (with POPULATIONS.json manifest)

The build process will:

Read POPULATIONS.json to discover all populations
Load each population's metadata and frequency data
Validate each population independently
Pack all into a single .hfx archive with the manifest

Note: Using the flat naming scheme (pop*_metadata.json, pop*_frequencies.csv) allows for consistent file structure whether you have a single population or many. Even single-population submissions can use this pattern for consistency.

CLI: Pack (low-level)

For direct packing when you already have a metadata.json:

hfx-pack metadata.json -o dist/example.hfx --manifest --hash sha256

CLI: Inspect

hfx-inspect metadata.json       # Inspect a metadata.json file
hfx-inspect example.hfx         # Inspect a bundled .hfx archive

CLI: QC

hfx-qc metadata.json --write-metadata --topk 10 100 1000

Streamlit: Web UI

Launch the interactive web interface:

streamlit run hfx_tools/streamlit_app.py

The Streamlit app provides:

Folder browser - Select local folders with metadata/ and data/ subdirectories
File upload - Upload metadata.json and data files directly
Auto-update mode - Automatically sets metadata.frequencyLocation to point to uploaded data
Metadata preview - View JSON structure and what will be auto-updated before building
Validation preview - Run validators and see results
HFX download - Download the built .hfx file
Build logs - View detailed validation and packing logs

Validation Framework

The build process includes an extensible validation framework with built-in validators:

Metadata required fields - Ensures metadata.frequencyLocation is present
Frequency location - Validates frequency location format (inline, file://, http://)
Frequency data format - Checks inline frequency data structure, types, and duplicates
File references - Verifies that referenced data files exist

Validation results are logged and returned with error/warning levels. The build fails if any error-level validations fail.

Custom validators (for hackathon extensibility)

Hackathon participants can register custom validators:

from hfx_tools.validators import ValidationFramework, ValidationResult

def my_custom_validator(metadata_json, hfx_obj, data_folder):
    # Your validation logic here
    return ValidationResult(
        validator_name="my_validator",
        passed=True,
        message="My validation passed",
        level="info"  # or "warning", "error"
    )

validator_framework = ValidationFramework()
validator_framework.register_validator("my_validator", my_custom_validator)

Common Use Cases

Scenario 1: Batch submission from local folder

Build and submit multiple HFX files from organized folders:

for dir in submissions/*/; do
  hfx-build "$dir" -n "$(basename $dir)" -o dist/
done

Scenario 2: Remote frequency data

Point to frequencies hosted on Zenodo or S3 without bundling:

{
  "frequencyLocation": "https://zenodo.org/record/12345/files/data.csv",
  ...
}

Build skips file detection and includes only the metadata:

hfx-build my_submission -n my_file --no-auto-update-location

Scenario 3: Inline small frequencies

For small datasets, embed frequencies directly in JSON:

{
  "frequencyLocation": "inline",
  "frequencyData": [
    {"haplotype": "A*01:01", "frequency": 0.123},
    {"haplotype": "A*01:02", "frequency": 0.456}
  ]
}

Scenario 4: Multiple populations in one submission

Combine data from multiple populations into a single HFX file:

# Folder structure
submission/
├── metadata/
│   ├── european_metadata.json
│   ├── asian_metadata.json
│   └── african_metadata.json
├── data/
│   ├── european_frequencies.csv
│   ├── asian_frequencies.csv
│   └── african_frequencies.csv
└── POPULATIONS.json

POPULATIONS.json:

{
  "populations": [
    {
      "id": "european",
      "name": "European Population",
      "metadataFile": "european_metadata.json",
      "frequencyFile": "european_frequencies.csv"
    },
    {
      "id": "asian",
      "name": "Asian Population",
      "metadataFile": "asian_metadata.json",
      "frequencyFile": "asian_frequencies.csv"
    },
    {
      "id": "african",
      "name": "African Population",
      "metadataFile": "african_metadata.json",
      "frequencyFile": "african_frequencies.csv"
    }
  ]
}

Build:

hfx-build submission -n global_study
# Output: global_study.hfx (contains all 3 populations)

Scenario 5: Programmatic use in Python

from hfx_tools.build import build

result = build(
    input_folder="my_data/",
    output_name="my_submission",
    output_dir="dist/",
    hash_algorithm="sha256",
    include_manifest=True
)
print(f"Build {'succeeded' if result.success else 'failed'}")
for validation in result.validations:
    print(f"  {validation.level}: {validation.message}")

Developer API

Using the Validation Framework

from hfx_tools.validators import ValidationFramework, ValidationResult

# Create framework
validator = ValidationFramework()

# Add custom validation
def check_population_size(metadata_json, hfx_obj, data_folder):
    pop_size = metadata_json.get("populationSize", 0)
    if pop_size < 100:
        return ValidationResult(
            validator_name="population_size",
            passed=False,
            message=f"Population too small: {pop_size} < 100",
            level="warning"
        )
    return ValidationResult(
        validator_name="population_size",
        passed=True,
        message=f"Population size OK: {pop_size}",
        level="info"
    )

validator.register_validator("population_size", check_population_size)

# Run validations
results = validator.validate(metadata, hfx, data_folder)

Building programmatically

from hfx_tools.build import build
from hfx_tools.io import read_metadata_json

# Load and modify metadata before building
metadata = read_metadata_json("metadata.json")
metadata["submissionNotes"] = "Added via script"

# Build with custom settings
result = build(
    input_folder=".",
    output_name="my_hfx",
    hash_algorithm="sha256",
    include_manifest=True
)

if not result.success:
    print("Validation errors:")
    for v in result.validations:
        if v.level == "error":
            print(f"  - {v.message}")

Package contents

.
├── __init__.py
├── build.py           # Build orchestration (NEW - hackathon MVP)
├── cli.py             # Command-line interface
├── inspect.py         # HFX inspection tools
├── io.py              # JSON and file I/O
├── pack.py            # Low-level packing
├── qc.py              # Quality control
├── streamlit_app.py   # Web UI (NEW - hackathon MVP)
├── util.py            # Utilities
├── validators.py      # Validation framework (NEW - hackathon MVP)
├── Makefile
└── pyproject.toml

Development & Contributing

Development setup

# Clone and install with dev dependencies
git clone https://github.com/nmdp-bioinformatics/hfx-tools
cd hfx-tools
make sync EXTRAS="dev,lint"

Running tests and linting

make fmt       # Format code
make lint      # Check code style
make test      # Run test suite
make build     # Build distribution

Project structure for contributors

validators.py - Add new validators here (see ValidationResult class)
build.py - Core build logic, add workflow features here
cli.py - Command-line entry points, add new commands here
streamlit_app.py - Web UI, add interactive features here

Submitting changes

Fork the repo
Create a feature branch (git checkout -b feature/my-feature)
Add tests for new functionality
Run make lint test to verify
Submit a pull request

Troubleshooting

Issue: "frequencyLocation not found"

Cause: Metadata doesn't include the frequencyLocation field.

Solution: Add to metadata.json:

{
  "frequencyLocation": "file://data/frequencies.csv"
}

Or use inline frequencies if no external file:

{
  "frequencyLocation": "inline",
  "frequencyData": [...]
}

Issue: Validation errors but can't see why

Solution: Check the build log:

hfx-build my_data -n output
cat my_data/output.build.log    # Detailed validation results

Issue: File not found in bundle

Cause: Data file exists but metadata.frequencyLocation points to wrong path.

Solution: Ensure relative paths match structure:

my_data/
├── metadata/
│   └── metadata.json    # with frequencyLocation: "file://data/my_file.csv"
└── data/
    └── my_file.csv      # ← matches the path

Issue: Permission denied when creating .venv

Solution: Ensure write permission to directory:

mkdir -p ~/.hfx-tools
make sync VENV=~/.hfx-tools/.venv

Issue: POPULATIONS.json not recognized

Cause: File is missing, malformed, or in the wrong location.

Solution: Ensure POPULATIONS.json is at the root of input folder (same level as metadata/ and data/):

input_folder/
├── POPULATIONS.json         # ← Must be here
├── metadata/
└── data/

Validate JSON syntax:

python -m json.tool input_folder/POPULATIONS.json

Issue: Population metadata mismatch

Cause: POPULATIONS.json references files that don't exist in metadata/ or data/.

Solution: Check that filenames match exactly:

{
  "populations": [
    {
      "id": "pop1",
      "metadataFile": "pop1_metadata.json",    # ← Must exist in metadata/
      "frequencyFile": "pop1_frequencies.csv"  # ← Must exist in data/
    }
  ]
}

Resources

HFX Specification - Authoritative format specification and schema
phycus - Related NMDP bioinformatics tools
Issues & Discussions - Report bugs or suggest features
HFX Spec Issues - Discuss spec-related questions

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
example		example
hfx_tools		hfx_tools
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

hfx-tools

hfx-tools

Key schema facts

Install

Basic installation

With optional dependencies

Quick Start

Architecture

Usage

Frequency Location Types

CLI: Build from folder

CLI: Build with multiple populations

CLI: Pack (low-level)

CLI: Inspect

CLI: QC

Streamlit: Web UI

Validation Framework

Custom validators (for hackathon extensibility)

Common Use Cases

Scenario 1: Batch submission from local folder

Scenario 2: Remote frequency data

Scenario 3: Inline small frequencies

Scenario 4: Multiple populations in one submission

Scenario 5: Programmatic use in Python

Developer API

Using the Validation Framework

Building programmatically

Package contents

Development & Contributing

Development setup

Running tests and linting

Project structure for contributors

Submitting changes

Troubleshooting

Issue: "frequencyLocation not found"

Issue: Validation errors but can't see why

Issue: File not found in bundle

Issue: Permission denied when creating .venv

Issue: POPULATIONS.json not recognized

Issue: Population metadata mismatch

Resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages