Skip to content

Major refactor#7

Open
mparker2 wants to merge 32 commits into
masterfrom
experiment_refactor
Open

Major refactor#7
mparker2 wants to merge 32 commits into
masterfrom
experiment_refactor

Conversation

@mparker2

Copy link
Copy Markdown
Collaborator

Pull Request Outline: Experiment Refactor

Summary

This is a major architectural refactor that reorganizes the coelsch codebase to better separate concerns through a new experimental design system. The refactor introduces a dedicated coelsch/experiment/ module that centralizes how experimental parameters, genotypes, and genetic crossing structures are defined and managed throughout the pipeline.


What's New (Key Changes)

1. New Experimental Design Module (coelsch/experiment/)

  • Purpose: Centralized handling of experimental metadata and genotype structures
  • New Files:
    • experiment/params.py - ExperimentParams dataclass defining lifecycle stage, crossing strategy, sequencing type, and genotyping strategy
    • experiment/genotypes.py - GenotypeKey and PositionalGenotypes classes for parsing and managing genotype expressions (e.g., (col0*ler), (col0*ler)[f]*cvi0[m])
    • experiment/design.py - ExperimentalDesign orchestrator combining params and genotypes
    • experiment/factories.py - Factory functions to create experimental designs from various inputs
    • experiment/utils.py - Utilities for extracting haplotype info from BAM/VCF files
  • Replaces: Ad-hoc ploidy_type and seq_type string parameters scattered throughout the codebase

2. Genotype Key Refactoring

  • Old: coelsch/load/genotype.py had a simple GenotypeKey class limited to two-level hierarchies
  • New: coelsch/experiment/genotypes.py has a complete rewrite of GenotypeKey:
    • Supports arbitrary nesting for complex crosses (e.g., backcross, testcross, three/four-way)
    • Can parse genotype strings with optional sex labels: (col0[f]*ler[m]), (col0*ler)*(col0*cvi0)
    • Provides sex role tracking and founder haplotype extraction
    • Used throughout for consistency

3. Genotyping Refactored into coelsch/load/genotyping.py

  • Old: coelsch/load/genotype.py (deleted)
  • New: coelsch/load/genotyping.py separates genotyping EM algorithm from genotype structures
  • Now works with ExperimentalDesign instead of custom GenotypesSet class
  • Functions: em_assign, assign_genotype_with_em, parallel_assign_genotypes, genotype_from_inv_counts, etc.

4. Enhanced Data Structure Initialization

  • Records (MarkerRecords, PredictionRecords) now initialized with:
    # Old
    MarkerRecords(chrom_sizes, bin_size, seq_type=..., ploidy_type=...)
    
    # New
    MarkerRecords(chrom_sizes, bin_size, experiment_params=experimental_design.experiment_params)
  • Provides experiment metadata at record creation time

5. Refactored Load Commands

  • run_loadbam() and run_loadcsl() now accept:
    • --lifecycle-stage (gametes/progeny) replaces --ploidy-type
    • --crossing-strategy (f1/f2/backcross/testcross/three_way/four_way)
    • --genotyping-strategy (founder/recombinant/auto)
    • --sample-unit (single_cell/bulk/auto)
  • Creates ExperimentalDesign upfront, passes to BAM/cellSNP loaders
  • Unified genotyping entry point via experimental_design parameter

6. Background Cleaning Refactored

  • coelsch/clean/commands.py now derives parameters from experiment_params instead of legacy ploidy_type
  • New expected_haplotype_ratio() function in coelsch/clean/normalise.py derives expected ratios from crossing strategy

7. Mask/Imbalance Detection Updated

  • coelsch/clean/mask.py refactored to use expected haplotype ratios from experiment design
  • Functions like create_single_cell_haplotype_imbalance_mask() now accept expected_ratio parameter instead of ploidy_type
  • Supports multi-haplotype masking (e.g., three-way, four-way crosses)

8. Plot Module Restructured

  • Old: coelsch/plot.py (monolithic, ~1000 lines, deleted)
  • New: coelsch/plot/ package with:
    • core.py - Shared plotting utilities (subplots, etc.)
    • markerplots.py - Single-cell marker coverage plots
    • happlots.py - Dataset-level recombination, allele ratio, distortion plots
    • commands.py - CLI dispatcher
  • Supports dynamic color palettes for multi-haplotype crosses
  • Auto-scaling of y-axis heights

9. Prediction (HMM) Updates

  • coelsch/predict/rhmm/estimate.py completely rewritten:
    • New estimate_emissions() function replacing estimate_haploid_emissions(), estimate_diploid_emissions_*() variants
    • Uses ExperimentParams to determine expected component dosages
    • Supports arbitrary multi-way crosses (not just haploid/diploid)
  • coelsch/predict/rhmm/independent.py added:
    • IndependentMeiosesHMM for three/four-way crosses where two meioses can be modeled independently
  • coelsch/predict/crossovers.py refactored:
    • samples_to_crossover_events() replaces samples_to_crossover_positions()
    • Exports meiosis index alongside bin and sign
  • coelsch/predict/gt_assignment.py added (new):
    • Assigns sampled crossover events to ground-truth events for validation

10. Distortion Analysis Updated

  • coelsch/distortion.py refactored:
    • segregation_distortion_chroms() now accepts expected_probs parameter
    • Uses experiment_params.haplotype_dosage for null hypothesis
    • Helper functions: _chrom_haplotype_probabilities(), _expected_contingency(), _marginal_expected_contingency(), _soft_contingency(), _g_test_lod()

11. Stats Module Updated

  • New functions in coelsch/stats.py:
    • n_crossovers() - Calculate crossover count from haplotype predictions (moved from API)
    • Likely more refactoring here (not shown in diff but referenced)

12. CLI Command Line Options Refactored

  • coelsch/main/opts/common_opts.py:
    • Replaced --ploidy-type (haploid/diploid_bc1/diploid_f2) with --lifecycle-stage and --crossing-strategy
    • New --sample-unit, --genotyping-strategy options
  • coelsch/main/opts/callbacks.py:
    • validate_loadbam_input() and validate_loadcsl_input() updated to handle new experimental design options
    • Auto-detection: --genotyping-strategy=auto infers from presence of --recombinant-parent-jsons
  • coelsch/main/opts/load_opts.py:
    • Simpler --crossing-combinations parsing (delegates to GenotypeKey.from_str())
    • Default --hap-tag-type changed to multi_haplotype
  • New: validate_sim_input() in callbacks
  • New: Sim-specific options (--target-crossing-strategy, --sim-cross-only, --threshold-ground-truth)

13. CLI Utilities

  • coelsch/main/utils.py (new):
    • kolle_alaaf() - Fancy ASCII art "kolle alaaf" citation/fun function

14. Gitignore & Test Data

  • Updated .gitignore to ignore REQUIRED.md and test_data/ directory

Benefits of This Refactor

  1. Unified Genotype Representation: Single GenotypeKey class handles all crossing structures
  2. Extensible Experimental Metadata: ExperimentParams easily extended with new properties (e.g., haplotype_dosage, n_haplotype_states)
  3. Better Multi-Way Support: Complex crossing strategies with >2 haplotypes now supported
  4. Cleaner Data Model: Records initialized with full context upfront
  5. Maintainability: Related concerns (genotypes, crossing strategies, dosages) grouped in experiment/ module

Files Modified/Created Summary

Category Action Files
New Module Created coelsch/experiment/{__init__,params,genotypes,design,factories,utils}.py
Genotyping Moved & Refactored load/genotype.pyload/genotyping.py (restructured)
Cleaning Deleted clean/background.py
Cleaning Refactored clean/{commands,mask,normalise}.py
Plotting Deleted plot.py (monolithic)
Plotting Created plot/{__init__,core,markerplots,happlots,commands}.py
Prediction Refactored predict/rhmm/{estimate,model}.py
Prediction Created predict/rhmm/independent.py, predict/gt_assignment.py
Distortion Updated distortion.py
API Updated api.py (uses new n_crossovers from stats)
Records Updated records.py (new initialization signature)
Load Updated load/{commands,loadbam,loadcsl}/*.py
CLI Updated main/{opts/*,cli,utils}.py

Migration Notes

Users will need to update CLI calls:

  • --ploidy-type haploid--lifecycle-stage gametes --crossing-strategy f1
  • --ploidy-type diploid_f2--lifecycle-stage progeny --crossing-strategy f2
  • --ploidy-type diploid_bc1--lifecycle-stage progeny --crossing-strategy backcross
  • Programmatic API users: pass experiment_params to MarkerRecords/PredictionRecords instead of separate seq_type/ploidy_type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant