Tools for working with HFX submissions (Haplotype Frequency Exchange).
This repo provides composable command line tools and a Streamlit app for building, packing, inspecting, and validating HFX documents, implementing the HFX specification. Key features include:
build- Build HFX bundles from a folder structure with automatic validationpack- Pack HFX archives from metadata.json with optional manifests and checksumsqc- Compute quality control statisticsinspect- Inspect metadata or bundled HFX files- Validation framework - Extensible validation with built-in validators
- Streamlit UI - Web-based interface for building HFX files
-
metadata.frequencyLocationcontrols where frequencies are stored: either"inline"or a URI (e.g.,file://data/frequencies.csv) (see HFX specification). -
If inline, the JSON may include
frequencyData(array of{haplotype, frequency}).
python -m venv .venv
source .venv/bin/activate
pip install -e .# For Parquet support
pip install -e ".[parquet]"
# For Streamlit web UI
pip install -e ".[streamlit]"
# For development
pip install -e ".[dev,lint]"5-minute walkthrough for the most common workflow:
# 1. Create input folder with metadata and data
mkdir -p my_submission/{metadata,data}
cp my_metadata.json my_submission/metadata/metadata.json
cp my_frequencies.csv my_submission/data/frequencies.csv
# 2. Build and validate
hfx-build my_submission -n my_hfx_file
# 3. Done! Check output
ls -la my_submission/my_hfx_file.hfx
cat my_submission/my_hfx_file.build.logFor a guided interactive experience, launch the Streamlit web UI:
streamlit run hfx_tools/streamlit_app.pyhfx-tools follows a layered architecture:
CLI / Streamlit UI (user-facing)
↓
build.py (orchestration)
↓
validators.py (validation rules) ← pack.py (packing logic)
↓
io.py (file I/O, JSON parsing)
- CLI layer (
cli.py) - Parses command-line arguments and delegates to build/pack/inspect/qc - Build orchestration (
build.py) - High-level workflow: reads metadata → detects files → validates → packs - Validation framework (
validators.py) - Pluggable validators for extensibility - Packing logic (
pack.py) - Low-level archive creation (ZIP with metadata, data, optional manifest) - I/O utilities (
io.py) - JSON parsing, file reading, consistent error handling
This design allows hackathon participants to:
- Use the CLI for quick workflows
- Call
build()directly from Python for programmatic use - Register custom validators without modifying core code
- Extend with custom QC statistics
The HFX standard supports four types of frequency data locations:
- Inline -
"frequencyLocation": "inline"withfrequencyDataarray in same JSON - Remote -
"frequencyLocation": "https://zenodo.org/.../data.csv"or S3 URL - File (relative) -
"frequencyLocation": "file://data/frequencies.csv"pointing to file within HFX bundle - File (parquet) - Same as above but with
.parquetextension
The most common workflow for hackathons and batch processing:
hfx-build /path/to/input_folder -n output_nameThis will:
- Read metadata from
input_folder/metadata/ - Auto-detect frequency data files in
input_folder/data/ - Auto-update
metadata.frequencyLocationtofile://data/<filename>(unless already set to remote or inline) - Validate all data with built-in validators
- Pack into a single
output_name.hfxfile (self-contained bundle with all data) - Log all validation results to
output_name.build.log
Expected folder structure:
input_folder/
├── metadata/
│ └── metadata.json # Required: HFX metadata + inline data (optional)
└── data/
└── frequencies.csv # Optional: if frequencyLocation = "file://frequencies.csv"
Example:
mkdir -p example/{metadata,data}
cp metadata.json example/metadata/
cp frequencies.csv example/data/
hfx-build example -n my_submission
# Output: example/my_submission.hfxOptions:
-n, --name NAME- Output filename (required, without .hfx)-o, --out DIR- Output directory (defaults to input folder)--no-manifest- Skip MANIFEST.json in archive--hash {md5,sha256,none}- Hash algorithm (default: sha256)--no-auto-update-location- Don't auto-updatemetadata.frequencyLocation(advanced)
For submissions with multiple populations, use a flat naming scheme with a POPULATIONS.json manifest. This keeps the structure identical whether you have 1 or many populations:
input_folder/
├── metadata/
│ ├── pop1_metadata.json
│ ├── pop2_metadata.json
│ └── pop3_metadata.json
├── data/
│ ├── pop1_frequencies.csv
│ ├── pop2_frequencies.csv
│ └── pop3_frequencies.csv
└── POPULATIONS.json
POPULATIONS.json structure:
{
"populations": [
{
"id": "pop1",
"name": "Population 1",
"metadataFile": "pop1_metadata.json",
"frequencyFile": "pop1_frequencies.csv"
},
{
"id": "pop2",
"name": "Population 2",
"metadataFile": "pop2_metadata.json",
"frequencyFile": "pop2_frequencies.csv"
}
]
}Build command:
hfx-build input_folder -n multi_population_submission
# Generates: multi_population_submission.hfx (with POPULATIONS.json manifest)The build process will:
- Read
POPULATIONS.jsonto discover all populations - Load each population's metadata and frequency data
- Validate each population independently
- Pack all into a single
.hfxarchive with the manifest
Note: Using the flat naming scheme (pop*_metadata.json, pop*_frequencies.csv) allows for consistent file structure whether you have a single population or many. Even single-population submissions can use this pattern for consistency.
For direct packing when you already have a metadata.json:
hfx-pack metadata.json -o dist/example.hfx --manifest --hash sha256hfx-inspect metadata.json # Inspect a metadata.json file
hfx-inspect example.hfx # Inspect a bundled .hfx archivehfx-qc metadata.json --write-metadata --topk 10 100 1000Launch the interactive web interface:
streamlit run hfx_tools/streamlit_app.pyThe Streamlit app provides:
- Folder browser - Select local folders with metadata/ and data/ subdirectories
- File upload - Upload metadata.json and data files directly
- Auto-update mode - Automatically sets
metadata.frequencyLocationto point to uploaded data - Metadata preview - View JSON structure and what will be auto-updated before building
- Validation preview - Run validators and see results
- HFX download - Download the built .hfx file
- Build logs - View detailed validation and packing logs
The build process includes an extensible validation framework with built-in validators:
- Metadata required fields - Ensures
metadata.frequencyLocationis present - Frequency location - Validates frequency location format (inline, file://, http://)
- Frequency data format - Checks inline frequency data structure, types, and duplicates
- File references - Verifies that referenced data files exist
Validation results are logged and returned with error/warning levels. The build fails if any error-level validations fail.
Hackathon participants can register custom validators:
from hfx_tools.validators import ValidationFramework, ValidationResult
def my_custom_validator(metadata_json, hfx_obj, data_folder):
# Your validation logic here
return ValidationResult(
validator_name="my_validator",
passed=True,
message="My validation passed",
level="info" # or "warning", "error"
)
validator_framework = ValidationFramework()
validator_framework.register_validator("my_validator", my_custom_validator)Build and submit multiple HFX files from organized folders:
for dir in submissions/*/; do
hfx-build "$dir" -n "$(basename $dir)" -o dist/
donePoint to frequencies hosted on Zenodo or S3 without bundling:
{
"frequencyLocation": "https://zenodo.org/record/12345/files/data.csv",
...
}Build skips file detection and includes only the metadata:
hfx-build my_submission -n my_file --no-auto-update-locationFor small datasets, embed frequencies directly in JSON:
{
"frequencyLocation": "inline",
"frequencyData": [
{"haplotype": "A*01:01", "frequency": 0.123},
{"haplotype": "A*01:02", "frequency": 0.456}
]
}Combine data from multiple populations into a single HFX file:
# Folder structure
submission/
├── metadata/
│ ├── european_metadata.json
│ ├── asian_metadata.json
│ └── african_metadata.json
├── data/
│ ├── european_frequencies.csv
│ ├── asian_frequencies.csv
│ └── african_frequencies.csv
└── POPULATIONS.jsonPOPULATIONS.json:
{
"populations": [
{
"id": "european",
"name": "European Population",
"metadataFile": "european_metadata.json",
"frequencyFile": "european_frequencies.csv"
},
{
"id": "asian",
"name": "Asian Population",
"metadataFile": "asian_metadata.json",
"frequencyFile": "asian_frequencies.csv"
},
{
"id": "african",
"name": "African Population",
"metadataFile": "african_metadata.json",
"frequencyFile": "african_frequencies.csv"
}
]
}Build:
hfx-build submission -n global_study
# Output: global_study.hfx (contains all 3 populations)from hfx_tools.build import build
result = build(
input_folder="my_data/",
output_name="my_submission",
output_dir="dist/",
hash_algorithm="sha256",
include_manifest=True
)
print(f"Build {'succeeded' if result.success else 'failed'}")
for validation in result.validations:
print(f" {validation.level}: {validation.message}")from hfx_tools.validators import ValidationFramework, ValidationResult
# Create framework
validator = ValidationFramework()
# Add custom validation
def check_population_size(metadata_json, hfx_obj, data_folder):
pop_size = metadata_json.get("populationSize", 0)
if pop_size < 100:
return ValidationResult(
validator_name="population_size",
passed=False,
message=f"Population too small: {pop_size} < 100",
level="warning"
)
return ValidationResult(
validator_name="population_size",
passed=True,
message=f"Population size OK: {pop_size}",
level="info"
)
validator.register_validator("population_size", check_population_size)
# Run validations
results = validator.validate(metadata, hfx, data_folder)from hfx_tools.build import build
from hfx_tools.io import read_metadata_json
# Load and modify metadata before building
metadata = read_metadata_json("metadata.json")
metadata["submissionNotes"] = "Added via script"
# Build with custom settings
result = build(
input_folder=".",
output_name="my_hfx",
hash_algorithm="sha256",
include_manifest=True
)
if not result.success:
print("Validation errors:")
for v in result.validations:
if v.level == "error":
print(f" - {v.message}").
├── __init__.py
├── build.py # Build orchestration (NEW - hackathon MVP)
├── cli.py # Command-line interface
├── inspect.py # HFX inspection tools
├── io.py # JSON and file I/O
├── pack.py # Low-level packing
├── qc.py # Quality control
├── streamlit_app.py # Web UI (NEW - hackathon MVP)
├── util.py # Utilities
├── validators.py # Validation framework (NEW - hackathon MVP)
├── Makefile
└── pyproject.toml
# Clone and install with dev dependencies
git clone https://github.com/nmdp-bioinformatics/hfx-tools
cd hfx-tools
make sync EXTRAS="dev,lint"make fmt # Format code
make lint # Check code style
make test # Run test suite
make build # Build distribution- validators.py - Add new validators here (see
ValidationResultclass) - build.py - Core build logic, add workflow features here
- cli.py - Command-line entry points, add new commands here
- streamlit_app.py - Web UI, add interactive features here
- Fork the repo
- Create a feature branch (
git checkout -b feature/my-feature) - Add tests for new functionality
- Run
make lint testto verify - Submit a pull request
Cause: Metadata doesn't include the frequencyLocation field.
Solution: Add to metadata.json:
{
"frequencyLocation": "file://data/frequencies.csv"
}Or use inline frequencies if no external file:
{
"frequencyLocation": "inline",
"frequencyData": [...]
}Solution: Check the build log:
hfx-build my_data -n output
cat my_data/output.build.log # Detailed validation resultsCause: Data file exists but metadata.frequencyLocation points to wrong path.
Solution: Ensure relative paths match structure:
my_data/
├── metadata/
│ └── metadata.json # with frequencyLocation: "file://data/my_file.csv"
└── data/
└── my_file.csv # ← matches the path
Solution: Ensure write permission to directory:
mkdir -p ~/.hfx-tools
make sync VENV=~/.hfx-tools/.venvCause: File is missing, malformed, or in the wrong location.
Solution: Ensure POPULATIONS.json is at the root of input folder (same level as metadata/ and data/):
input_folder/
├── POPULATIONS.json # ← Must be here
├── metadata/
└── data/
Validate JSON syntax:
python -m json.tool input_folder/POPULATIONS.jsonCause: POPULATIONS.json references files that don't exist in metadata/ or data/.
Solution: Check that filenames match exactly:
{
"populations": [
{
"id": "pop1",
"metadataFile": "pop1_metadata.json", # ← Must exist in metadata/
"frequencyFile": "pop1_frequencies.csv" # ← Must exist in data/
}
]
}- HFX Specification - Authoritative format specification and schema
- phycus - Related NMDP bioinformatics tools
- Issues & Discussions - Report bugs or suggest features
- HFX Spec Issues - Discuss spec-related questions