Skip to content

Config layer: Pydantic settings, default YAML, reference assets#1

Open
cropsgg wants to merge 2 commits into
mainfrom
pr/01-config-reference-assets
Open

Config layer: Pydantic settings, default YAML, reference assets#1
cropsgg wants to merge 2 commits into
mainfrom
pr/01-config-reference-assets

Conversation

@cropsgg
Copy link
Copy Markdown
Owner

@cropsgg cropsgg commented Apr 15, 2026

Summary

Adds a typed configuration layer (Pydantic + YAML) and bundled reference assets so the project can move from hard-coded HBB strings toward a gene-agnostic, reproducible setup.

What changed

  • config/default.yaml — Single source of truth: gene/refseq, reference FASTA path, ClinVar query and parsing-related flags, synthetic variant templates, Bloom/DNABERT/BGPCA/training hyperparameters, cache paths, Gradio defaults.
  • bloom_dnabert/settings.pyAppSettings models, load_settings(), path resolution from the repo root, optional BLOOM_CONFIG env override, settings fingerprint helper for cache invalidation.
  • bloom_dnabert/reference.py — FASTA loading and optional SHA-256 verification.
  • bloom_dnabert/codon.py — Loads codon table from JSON or override path in settings.
  • bloom_dnabert/data/ — Reference transcript FASTA, pathogenic k-mer seeds, codon table JSON.

Why this PR exists

Downstream work (data loader, training, UI) needs stable contracts for gene symbol, transcript accession, reference sequence, and training knobs. Centralizing them avoids scattering magic numbers and makes swapping targets a config change instead of a refactor.

Reviewer notes

  • Merge first in the stacked series; later PRs build on this branch.
  • requirements.txt only adds pydantic and PyYAML here so settings is installable when this branch is checked out alone.

How to verify

python -c "from bloom_dnabert.settings import load_settings; from pathlib import Path; s=load_settings(Path('config/default.yaml')); print(s.gene.symbol, s.reference.fasta_path)"

@cropsgg
Copy link
Copy Markdown
Owner Author

cropsgg commented Apr 15, 2026

Stacked PRs (merge order)\n\nThis PR is the base of a four-part stack. After it lands on main, either merge #2#3#4 in sequence, or change the base of the open PRs to main and rebase each branch so the diff stays clean.\n\n- #2 — data loader\n- #3 — training pipeline\n- #4 — app, requirements, README

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant