Skip to content

DTW DynaCLR Monorepo#398

Open
edyoshikun wants to merge 61 commits intomodular-viscy-stagingfrom
dynadtw
Open

DTW DynaCLR Monorepo#398
edyoshikun wants to merge 61 commits intomodular-viscy-stagingfrom
dynadtw

Conversation

@edyoshikun
Copy link
Copy Markdown
Member

No description provided.

edyoshikun and others added 30 commits March 31, 2026 13:43
Add normalization columns (norm_mean/std/median/iqr/max/min),
z_focus_mean, and TCZYX shape columns to the cell index schema.
preprocess_cell_index reads per-FOV zattrs and writes stats as
parquet columns for fast per-row normalization at training time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ExperimentRegistry.from_cell_index: build registry directly from
  preprocessed parquet + zarr metadata (no collection YAML needed)
- datamodule: cell_index_path as primary entry point, _train_final_crop
  changed from BatchedRandSpatialCropd to BatchedCenterSpatialCropd
  (random crop for Z/XY translation is now a user-configured augmentation)
- dataset: read norm stats from parquet columns, build_norm_meta fallback
- index: _align_parquet_columns, _resolve_dims from parquet Y/X_shape

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- DynaCLR-3D-BagOfChannels-v2: z_window=32, yx_patch=256,
  RandSpatialCrop(40,228,228) after affine for Z focus invariance
  + XY translation, CenterCrop(32,160,160) auto-appended.
  batch_size=256, 2 GPUs, 2-day wall time.
- Add dataloader_demo.py: Jupyter-style visualization of raw vs
  augmented anchor/positive batches with per-sample metadata
- Update demo configs and inspection scripts for new pipeline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
np.nanmin/nanmax fail on scipy sparse arrays. Convert to dense
before computing range stats so the command works on Seurat-exported
anndata zarr stores.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CLI for running evals
- DAG for evals
- yaml files for evals
… 3 base callbacks

   - model/contrastive_encoder_convnext_tiny.yml: ConvNeXt-Tiny class_paths
   - model/dinov3_frozen_mlp.yml: frozen DINOv3 + MLP projection block
   - augmentations/ops_2d_mild.yml: OPS-specific mild augmentation pipeline
   - data/ops_gene_reporter.yml: OPS data defaults (patch sizes, sampling)
- train_linear_classifier() now returns a third value: raw val outputs
  (y_val, y_val_proba, classes) for downstream ROC curve plotting
- orchestrated run-linear-classifiers generates metrics_summary.pdf
  alongside the CSV: bar chart of AUROC/accuracy/F1 + per-task ROC curves
- Delete evaluate_dataset.py (argparse-based, not in CLI, superseded by
  orchestrator) and its example config
- Strip generate_comparison_report and its helpers from report.py;
  file is now CV-only
- Remove dead _detect_n_features() from cross_validation.py
- Update all callers of train_linear_classifier() to unpack 3-tuple
- Update DAG doc and linear classifiers README

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- FOVRecord.channel_markers: dict[str, str] maps zarr channel name to
  marker for a specific well (populated from Airtable channel_N_marker fields)
- ChannelEntry.wells: list[str] restricts a channel to a subset of wells;
  empty means valid in all wells
- build_collection auto-populates wells by comparing which wells have a
  non-None marker for each channel across all FOVRecords
- _build_experiment_tracks skips channel rows where ch.wells is non-empty
  and the current well is not in that set, preventing noise rows from
  mixed-plate experiments (e.g. viral sensor only in B/3, C/2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The glob */*/* on zarr v3 stores yields zarr.json files (e.g. A/2/zarr.json)
in addition to position directories. The previous check only stripped names
starting with "." (.zattrs, .zgroup) but missed zarr.json.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ollection

- DynaCLR-2D-MIP-BagOfChannels: add viral_sensor + Phase3D for
  2025_01_28, 2024_10_09, 2024_10_16; fix dragonfly tracks_path
  to point to inner zarr store (tracking.zarr/2024_08_14_...zarr)
- DynaCLR-3D-BagOfChannels-v2: add viral_sensor + Phase3D for
  2025_01_28, 2024_10_09, 2024_10_16
- DynaCLR-3D-BagOfChannels-v3: new collection copied from v2 with
  dragonfly tracks_path fix; v2 left intact for running training job
- DynaCLR-BoC-lc-evaluation-v1: add viral_sensor for all datasets;
  add Phase3D for 2025_01_28

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Wire load_config to delegate to load_composed_config so eval configs
  support base: recipe inheritance (same mechanism as training configs)
- Extract shared eval settings into 4 recipes: predict.yml, reduce.yml,
  plot_infectomics.yml, linear_classifiers_infectomics.yml
- Slim down DynaCLR-2D-BagOfChannels-v3, DynaCLR-2D-MIP-BagOfChannels-v1,
  DINOv3-temporal-MLP-2D-BagOfChannels-v1, and test_evaluation configs
  to use base: references — eliminating copy-pasted 14-experiment
  annotation blocks and shared step configs
- Fix ONNX inference to use GPU (CUDAExecutionProvider) and suppress
  pthread_setaffinity_np noise with intra/inter_op_num_threads=1
- Switch CTC tracking SLURM script to gpu partition

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix \bbf[\b_] -> \bbf(\b|_): inside a character class, \b is a
  backspace character, not a word boundary
- Add \bphc\b to detect phase-contrast (PhC) as label-free

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pandas 3+ uses Arrow-backed strings by default, which breaks anndata's
zarr writer. Apply the same fix in two code paths:
- embedding_writer.py: replace select_dtypes("string") with per-column
  isinstance checks for pd.StringDtype and Arrow-backed Categoricals
- zarr_utils.py: convert ArrowStringArray columns and index to object
  dtype before calling append_to_anndata_zarr

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- PHATE: default n_jobs from -1 (all cores) to 1 to prevent hogging
  shared SLURM nodes; exposed in PHATEConfig and compute_phate()
- Annotation: support (fov_name, t, track_id) join as fallback when
  both sides lack an 'id' column; normalize fov_name by stripping
  leading/trailing slashes to prevent join mismatches

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
For multiclass problems, compute one-vs-rest AUROC per class and report
as val_{class_name}_auroc columns in the results DataFrame.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- viscy-utils: add onnx, onnxscript to core deps; copairs to eval extras
- dynaclr: add tracking optional group (gurobipy, onnxruntime-gpu,
  py-ctcmetrics, tabulate, tracksdata) for CTC tracking benchmark
- Regenerate uv.lock

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- index.py: replace O(N*tau) Python loop in _compute_valid_anchors with
  vectorized pd.MultiIndex.isin(); add fit=False predict-mode fast path
  that skips anchor computation; add precomputed_valid_anchors to
  clone_with_subset() to avoid redundant recomputation; accept
  cell_index_df to avoid double-reading parquet
- dataset.py: replace per-row loops in _build_match_lookup with
  groupby().indices; skip lookup build in predict mode; add organelle,
  well, microscope to exported metadata columns
- datamodule.py: tune defaults (num_workers=4, cache_pool=500MB,
  pin_memory=True, buffer_size=4); use vectorized MultiIndex.isin for
  FOV split; reuse pre-loaded cell_index_df from ExperimentRegistry
- experiment.py: from_cell_index returns (registry, dataframe) tuple
  so callers can reuse the DataFrame without re-reading from disk

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use .get() with None default for transcriptome_anndata and skip the
barcode join when it is absent, allowing embeddings on datasets that
lack paired scRNA-seq.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Centralize cell_index_path to shared /hpc/projects/.../collections/
  dir across all training configs
- MIP model: extend z_extraction_window 11->20, z_focus_offset 0.5->0.3,
  yx_patch_size 192->256, add BatchedRandSpatialCropd for Z-invariance
- 3D BoC: num_workers 2->4; SLURM time limit 2d->4d
- Collection: mark DynaCLR-2D-BagOfChannels-v3 as [LEGACY]; fix well
  assignments in BoC-lc-evaluation-v1 (add A/1 for 07_24, remove
  incorrect B/1 and B/2 from 01_28)
- Add new collections: annotated MIP subset, test subset, alfi-eval
  (ALFI mitosis, 3 cell lines), microglia-eval (5 perturbations),
  benchmark_2exp (dataloader profiling)
- predict.yml: add TQDMProgressBar callback (refresh_rate=10)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- evaluate.py: remove all SLURM script generation (_generate_*_sh,
  _slurm_header, _run_local*); replace with prepare_configs() that
  generates YAML configs and prints a JSON manifest to stdout; rename
  CLI command evaluate -> prepare-eval-configs; add MMD config generators
- evaluate_config.py: remove SlurmConfig; add MMDStepConfig and
  ComparisonSpec imports; split PlotStepConfig.color_by into per-exp
  and combined_color_by; update TaskSpec.marker_filters docstring for
  auto-expand behaviour
- cli.py: add prepare-eval-configs, check-evals, append-annotations,
  append-predictions, split-embeddings, compute-mmd, plot-mmd-heatmap,
  evaluate-tracking-accuracy commands
- split_embeddings.py: new CLI to split combined embeddings.zarr by
  experiment, replacing inline SLURM script logic
- check_evals.py: new CLI to print eval completion status from registry
- eval_registry.yaml: declarative registry of models to evaluate
- Delete 4 stale SLURM-era eval configs (SlurmConfig schema removed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three modes for measuring embedding-space distribution shifts:
- Per-experiment (explicit comparison pairs, faceted by marker)
- Combined (pairwise cross-experiment with batch centering)
- Pooled (concatenates all experiments, BH FDR correction)

Core implementation:
- viscy_utils/evaluation/mmd.py: kernel MMD with median heuristic,
  Gaussian RBF kernel, unbiased estimator, and vectorized permutation
  test (avoids Python loops via binary label matrix multiplication)
- viscy_utils/evaluation/embedding_map.py: mAP via copairs for
  phenotypic profiling (optional dependency)
- evaluation/mmd/config.py: Pydantic config hierarchy for all three
  modes; temporal binning, shared bandwidth, balance_samples
- evaluation/mmd/compute_mmd.py: orchestrates the three analysis modes;
  computes activity_zscore = (mmd2 - null_mean) / null_std for
  cross-marker comparability; outputs per-marker CSV files
- evaluation/mmd/plotting.py: kinetics lines, heatmaps, activity
  z-score heatmaps, combined cross-experiment heatmaps, multi-panel
  grids, paired heatmaps with shared colorbar
- configs/evaluation/recipes/mmd_defaults.yml: shared algorithm defaults
  (1000 permutations, max 2000 cells, seed 42) for YAML inheritance
- tests/test_mmd.py: unit tests for MMD implementation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ver-time

- orchestrated.py: when marker_filters is None, auto-discover all unique
  obs["marker"] values and run one classifier per marker; save trained
  pipelines as {task}_{marker}.joblib with manifest.json; add
  _plot_f1_over_time for per-class F1 at each timepoint; output one
  {task}_summary.pdf per task (was a single merged PDF)
- orchestrated_test.py: update fixtures to expect 2 rows per task with
  auto-expansion; add test for sparse-marker skipping and F1-over-time
  plot generation
- append_annotations.py: new CLI to persist ground-truth annotation
  columns directly into per-experiment zarr obs
- append_predictions.py: new CLI to apply saved classifier pipelines to
  all cells in per-experiment zarrs, writing predicted_{task} to obs and
  predicted_{task}_proba to obsm

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When group_by is set (default "marker"), evaluate_smoothness iterates
over unique group values, computes smoothness per group, saves per-group
CSV, generates per-group plots, then aggregates via mean/std. Output
filenames now include experiment_name for disambiguation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Evaluates whether DynaCLR embeddings improve cell tracking on Cell
Tracking Challenge datasets vs an IoU baseline.

- tracking_accuracy/config.py: Pydantic models for ONNX model entries,
  CTC dataset entries, ILP solver weights, and full benchmark config
- tracking_accuracy/utils.py: seg_dir layout helper, pad_to_shape,
  normalize_crop (z-score using whole-frame statistics)
- tracking_accuracy/evaluate_tracking.py: main benchmark driver
- ctc_tracking_2d_mip_boc.yaml: DynaCLR-2D-MIP vs IoU on DIC-C2DL-HeLa
- ctc_tracking_2d_mip_boc_all.yaml: all CTC sequences variant
- export_onnx_2d_mip_boc.yml: config for exporting the MIP model to ONNX

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Pairplot: change diag_kind kde -> hist; rasterize scatter points to
  prevent PDF bloat; improve legend (alpha=1.0, larger marker sizes)
- Scatter 2D: improve legend (markerscale=6, fontsize=10, framealpha=1.0,
  edgecolor="black")

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
edyoshikun and others added 7 commits April 14, 2026 14:53
Replace 5 monolithic analysis scripts with a structured 5-stage pipeline
using DTW Barycenter Averaging (DBA) for principled trajectory alignment.

Core library (evaluation/pseudotime/dtw_alignment.py):
- build_infection_template(): DBA with medoid initialization from
  annotated transitioning cells; per-experiment z-score -> PCA ->
  L2-normalize preprocessing; time calibration maps template positions
  to real minutes
- dtw_align_tracks(): per-track DTW to template, produces pseudotime
  in [0,1] and label propagation fractions per template position
- alignment_results_to_dataframe(): assembles results DataFrame

Pipeline stages (scripts/pseudotime/):
- 0-build_templates: build DBA templates from annotated transitions,
  diagnostic lineage overview
- 1-align_cells: DTW-align all cell trajectories to template; alignment
  diagnostic plots (pseudotime vs real time, cost distributions, PCA)
- 2-evaluate_dtw: evaluate alignment against annotations (AUC, onset
  concordance, IoU)
- 3-organelle_dynamics: per-organelle embedding dynamics along infection
  pseudotime, remodeling heatmaps and montage grids
- 4-export_anndata: merge DTW results back into AnnData zarr copies
- cell_count_funnel.py: summarize cell/track filtering across all stages

Configs and tests:
- multi_template.yaml: switch to MIP embeddings dir, update embedding
  patterns for viral_sensor, G3BP1, SEC61 channels
- test_pseudotime.py: add TestTimeCalibration (monotonicity, round-trip)
  and TestMetricsContinuous (onset/peak detection)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- profile_stages.py: extend z_window 16->32; add I/O bandwidth
  reporting (MB/s, MB read per anchor+positive)
- benchmark_setup_time.py: benchmark _compute_valid_anchors and
  _build_match_lookup on 3.3M-row parquet to validate vectorization
- profile_num_workers.py: sweep num_workers to find optimal parallelism
- profile_predict_batch_size.py: sweep predict batch sizes
- test_2d_mip_augmentation.py: visual verification of 2D MIP
  augmentation pipeline (z-crop + MIP)
- explore_gut_parquet.py: exploratory script for gut dataset parquet

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- compare_evals.py: cross-model evaluation comparison that reads
  eval_registry.yaml outputs and generates comparison plots for
  smoothness, AUROC, and MMD activity z-scores across models
- microglia_alfi_analysis.py: PCA/UMAP embedding analysis for microglia
  (by perturbation) and ALFI HeLa (by cell cycle phase)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Config-driven pipeline for NFS-to-VAST dataset preparation:
- prepare.py: orchestrates concatenation, QC, and preprocessing steps
  driven by Airtable metadata
- prepare_cli.py: CLI entry point for the prepare pipeline
- configs/prepare_config.yml: example config for dataset preparation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- configs/cellanome/: per-run embed_dinov3.yml and embed_dynaclr.yml
  configs for 5 Cellanome flow cell runs (A549 infectomics panels,
  mixed GFP+RFP, SEC61B/G3BP1/pAL40 DENV rerun); embed_all.sh helper
- docs/DAGs/ai_ready_datasets.md: DAG for AI-ready dataset preparation
  pipeline
- docs/DAGs/pseudotime.md: DAG for DTW pseudotime pipeline stages
- docs/DAGs/training.md: DAG for model training workflow

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
anndata 0.12.9+ pulls pandas <3, so we pin 0.12.6 with pandas 3 and
manually downcast Arrow-backed strings. Remove once anndata 0.13
supports pandas 3 natively.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…es on registration

- Add microscope, labelfree_modality, treatment, hours_post_treatment to
  FOVRecord and DatasetRecord; parse from Airtable singleSelect responses
- Add all four fields to WELL_TEMPLATE_FIELDS so they propagate to per-FOV records
- Raise ValueError when a well template has no cell_line set (required for
  channel marker derivation — previously silently skipped)
- Auto-delete well template records after registration batch: register_fovs
  populates template_ids_to_delete; CLI calls batch_delete after create/update
- Add batch_delete to AirtableDatasets
- Wire microscope into build_collection so it flows to ExperimentEntry and
  cell_index parquet (was previously always empty string)
- Update all tests; fix pre-existing test regressions from is_dir() filter

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
edyoshikun and others added 22 commits April 15, 2026 16:38
compute_timing_metrics.py reduces each cell's cosine-distance-from-pre-baseline
curve to SNR-robust scalars (t_onset_abs, t50, t_peak, delta_peak,
rise_rate_per_hour) and pools into per-organelle distributions.

compute_label_timing.py does the same from LC predicted_{state} labels
(t_first_pos, t_run_start, pos_fraction, flips). Supervised projection
gives sharper cross-organelle separation (e.g. SEC61 pos_fraction=0.81 vs
G3BP1=0.00, p=1.6e-4) than unsupervised cosine distance.

Both ship a compute sub-command for per-organelle per-cell parquet plus
summary markdown, and a compare sub-command that merges parquets and
emits strip plots plus pairwise rank-sum tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds directory-layout entries for compute_timing_metrics.py (embedding
cosine-distance timing) and compute_label_timing.py (LC-prediction
timing), plus dedicated sections documenting per-cell scalars, outputs,
the aligned-only vs whole-track asymmetry, and example numbers for
SEC61 vs G3BP1 on sensor_all_07_24.

Notes the next planned iteration: configurable multi-dataset pool with
ZIKV/DENV virus-stratified comparison.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When multiple per-cell parquets from `compute` share an organelle_channel
but differ in query_set (e.g. ZIKV pool vs DENV pool, both on sensor),
the old compare step collapsed them into one group. Now:

- Auto-detect: split by organelle_channel if >1 present, else query_set.
- --group-by CLI flag to override the default.
- Markdown + plot headers reflect the grouping column.

Unblocks cross-virus comparison via paired single-virus query sets in
align_cells.yaml (sensor_zikv_pool, sensor_denv_pool).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five zarrs were generated by predict but skipped by the LC step because
they weren't listed in the annotations block:

- 2025_01_28_A549_viral_sensor_ZIKV_DENV
- 2025_01_28_A549_Phase3D_ZIKV_DENV
- 2024_11_07_A549_SEC61_DENV_viral_sensor
- 2025_01_24_A549_G3BP1_DENV_viral_sensor
- 2025_08_26_A549_viral_sensor_ZIKV

All five reuse their dataset's existing combined annotations CSV. The
effect for downstream Stage 3d label-timing: the ZIKV pool
(07_22 + 07_24 + 08_26 + 01_28 ZIKV) gains predicted_infection_state on
every sensor zarr, and DENV gets full coverage across 2024_11_07,
2025_01_24, and 2025_01_28 DENV well.

Re-run: `nextflow run main.nf --eval_config ... -resume` will skip
cached predict/split/reduce and only rerun LC + append_predictions + plot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_get_position` and `_get_tensorstore` were keyed by `fov_name` alone,
so the same FOV path (e.g. `A/3/0`, `0/3/000000`) shared across
experiments in a MultiExperimentDataModule returned the first-cached
experiment's zarr for every subsequent lookup. This caused samples
from later experiments to read pixels from the wrong store while
metadata still reported the correct experiment — silently corrupting
training batches. Key the caches by `(store_path, fov_name)` instead.

Verified by Pearson-correlating dataloader output against direct zarr
reads at the same coordinates: all 8 SEC61B anchors from 3 experiments
sharing `A/1/0`/`A/2/0`/`A/3/0` now match 1.0 (previously 2/8 matched,
6/8 had ~0 correlation).

Also explains previously-observed edge artifacts in patches despite
clamping: the cached zarr was from a different experiment with
different FOV dimensions, so clamp margins no longer matched the
actual image bounds.

Affects OPS and every DynaCLR training run with multiple experiments
sharing FOV names (DynaCLR-2D-MIP-BagOfChannels: 157 collisions,
DynaCLR-3D-BagOfChannels-v2: 112 collisions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-row pandas .iloc / .iterrows pattern in positive-pair lookup
was the dominant per-batch bottleneck: 4500 ms/batch at batch=512 on
the 81.5M-row OPS index. Each anchor triggered multiple pd.Series
constructions (~9 ms each) to look up match-key columns, resolve
lineage timepoints, and filter candidates by marker. At 50% GPU
utilization in the lite run, this bottleneck gated the whole pipeline.

Replace with a precomputed NumPy column cache:

  - `_build_anchor_cache()` extracts every valid_anchors column and
    the hot tracks columns (marker, channel_name, experiment, t,
    lineage_id) as `np.ndarray` at dataset __init__.
  - `_sample_positives_temporal()` vectorizes the lineage + tau
    lookup using NumPy fancy-index filtering.
  - `_sample_positives()` for column-match (SupCon) mode takes
    positional anchor indices from the sampler and does NumPy-direct
    key construction, with a single batched tracks.iloc gather at
    the end (one call instead of 512).
  - `_match_lookup` now stores np.ndarray values (zero-copy random
    choice) instead of Python lists.
  - `_extract_meta` uses NumPy label arrays instead of .iterrows().
  - SimCLR (`positive_cell_source="self"`) now clones the anchor
    tensor directly instead of running a second zarr read + meta
    extraction — halves per-batch wall time for SimCLR baselines.
  - `__getitems__` bag-of-channels path reads channel_name from the
    NumPy cache.
  - Predict branch replaces .iterrows() with NumPy column arrays.

Delete the now-unused per-row paths (`_find_positive`,
`_find_temporal_positive`, `_find_column_match_positive`) entirely —
keeping them as fallbacks would be a performance footgun for future
contributors.

Measured per-batch wall time (batch=64, demo subsample):
  - SupCon OPS:   ~80 ms (was 4500 ms at batch=512)
  - SimCLR self:  ~30 ms
  - Temporal:    ~200 ms (2D-MIP)

Correctness verified end-to-end:
  - Pearson correlation anchor vs direct zarr read = 1.0
  - SupCon positives share (gene_name, marker) 64/64
  - Temporal positives share lineage 64/64, all non-zero Δt
  - 22/22 existing dataset unit tests pass after test refactor to
    call the vectorized entry points

Affects every DynaCLR training configuration: OPS (SupCon),
DynaCLR-2D-MIP, DynaCLR-3D-BagOfChannels (temporal), and any SimCLR
baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two independent fixes for FlexibleBatchSampler on 16M+ row valid_anchors:

1. `__iter__` materialized the full epoch upfront — blocking DDP for
   several minutes before batch 0. Now yields batches lazily while
   preserving RNG draws across all ranks so DDP stays bit-identical.
2. `_precompute_groups` called pandas groupby on Arrow-backed columns,
   which routes every group slice through pyarrow.compute.take and took
   tens of minutes. Categorical fast path uses `cat.codes` +
   `np.flatnonzero`, and per-group-per-stratum uses `np.intersect1d`
   between prebuilt group/strat arrays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MultiExperimentTripletDataset caching fixes for 81M-row indices:

- `_build_anchor_cache` cached every column of valid_anchors/tracks,
  blowing per-rank RSS. Whitelist the 13 columns actually read in the
  hot path (store_path, fov_name, experiment, t, y_clamp, x_clamp,
  norm_*, channel_name, marker, lineage_id) plus user-supplied
  positive_match_columns and label columns.
- Cast high-cardinality string columns to Categorical before caching
  so indexing hits 4-8 byte codes instead of 40-80 byte object refs.
- Wrap cat-array lookups with `str()` in `_sample_positive_indices_temporal`
  and in `_build_match_lookup` because `_materialize_strings` upstream
  leaves these columns as Categorical — hashing a Categorical scalar
  would not match the str keys in `_lineage_timepoints`.
- Precompute per-experiment `tau_range_frames` to drop a registry call
  per anchor in the temporal sampling hot path.
- Refactor `_slice_patch` / `_slice_patches` / `_sample_positives` to
  take (arrays, indices) instead of DataFrame rows, eliminating
  `iterrows()` and per-row Series construction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deferred Categorical cast for `fov_name` and `well_name` in
`_align_parquet_columns` — upstream `cell_index.py` already casts the
low-cardinality text columns on load, but `fov_name` is rewritten here
by the position-prefix logic (Categorical columns would reject the
string concatenation), so the cast has to happen after the rewrite.

Makes the downstream train/val boolean-mask slice a fast int-code
gather instead of pyarrow.compute.take over the string buffer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three setup-time fixes for MultiExperimentDataModule._setup_fov_split
on ~80M-row indices:

- `_materialize_strings`: cast ArrowStringArray columns to Categorical
  before slicing. `df[bool_mask]` on Arrow-backed string columns
  routes through pyarrow.compute.take and scales catastrophically
  (7-8 min per call on 16M rows × 15 string cols). Categorical codes
  + categories make slicing pure NumPy fancy indexing on int codes.
- Replace `pd.MultiIndex.from_arrays / from_tuples` (hashes a Python
  tuple per row) with a per-experiment groupby walk that writes a
  row-aligned boolean mask, eliminating the 80M-tuple index build.
- Guard `val_index` / `val_dataset` construction on `val_tracks.empty`
  instead of `val_keys`, which gets dropped in the new mask-based flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`read_cell_index` now casts low-cardinality string columns
(experiment, marker, store_path, microscope, organelle, reporter,
channel_name) to pandas Categorical. ArrowStringArray-backed columns
route every boolean mask slice through pyarrow.compute.take, which
allocates a fresh buffer per string column and spiked peak RSS by
50+ GiB during train/val FOV partitioning on 80M-row indices.

High-cardinality columns (cell_id, tracks_path, lineage_id) stay
ArrowStringArray so we don't allocate millions of Python string
objects up front — the dataset reads them via the NumPy column cache.

`fov_name` is intentionally left as-is because `_align_parquet_columns`
rewrites it via string concatenation, which Categorical doesn't
support; it gets cast after the rewrite in the runtime index layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Subclassing Lightning's SaveConfigCallback to call
`wandb_logger.experiment` inside the setup hook deadlocked DDP on
≥2 ranks: non-zero ranks blocked at the wandb init barrier while
rank 0 was inside the hook, so the setup fence never cleared. Bug
was hidden under `fast_dev_run=True` because Lightning swaps the
real logger for DummyLogger, which doesn't touch wandb internals.

The resulting config saved to `trainer.log_dir` is already picked up
by the wandb files tab automatically when `save_dir` matches, so the
custom callback was net-negative — delete rather than patch.

Removes:
- `packages/viscy-utils/.../save_config_wandb.py`
- `SaveConfigToWandb` export in callbacks/`__init__.py`
- Entry in shared `trainer.yml` recipe
- Entry in OPS-1000genes-lite.yml

See `feedback_wandb_ddp_deadlock.md` for the full postmortem.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds OPS-style single-marker batch composition variants
(`batch_group_by: marker`, one reporter per batch) to complement
the default mixed-markers runs (`stratify_by=[perturbation, marker]`).

Run pairs for direct A/B comparison:
- DynaCLR-2D-MIP-BagOfChannels: mixed vs single-marker
- DynaCLR-3D-BagOfChannels-v2: mixed vs single-marker

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keeps the diagnostic configs accessible for reproducing DDP hangs,
memory profiling, and fast_dev_run sanity checks without cluttering
the production training directory. Production entry points stay in
`configs/training/`; `debug/` holds the single-node/single-GPU
variants that were used to isolate the SaveConfigToWandb DDP deadlock
and the ArrowStringArray memory spike.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In flat-parquet / bag-of-channels mode (one row per cell × channel),
`_pick_temporal_candidate` restricts positive candidates to rows with
the same marker as the anchor. But `_compute_valid_anchors` only
checked (lineage_id, t+tau) existence, so an anchor with
(lid, marker=Phase3D, t=50) could pass validation when
(lid, marker=GFP, t=51) exists — and then crash at sample time with
"No positive found" because no same-marker row exists in the window.

Fix: include `marker` in the match key when it's present as a
column in `tracks`. Validity now requires the shifted
(lineage_id, marker, t+tau) tuple to exist, matching what the
sampler actually enforces.

Detected in SLURM job 31265738 (2D-MIP single-marker): 268
"No positive found" errors across 66 epochs of training, with the
validation dataloader failing to complete even once — which is why
`loss/val` never appeared in wandb despite train loss logging.

Non-flat-parquet configs (one row per cell) are unaffected since
marker is constant per (lineage, t) there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_reconstruct_lineage` grouped tracks by `(experiment, fov)`, which
fuses cells from different wells that share an FOV number (e.g.
B/2/002001 and C/2/002001 both have `fov="002001"`). The per-group
track_id → global_track_id map then routes parent_track_id lookups
across wells, producing `lineage_id` strings that aliase across wells.

Downstream this crashes the temporal positive sampler with
"No positive found" because `_lineage_timepoints[(exp, lid)]` holds
rows from multiple wells mashed together. About 15-30% of lineages
in the 2D-MIP-BagOfChannels dataset were affected (29 of 30 experiments
had cross-well collisions).

Fix: group by `(experiment, well, fov)` when the `well` column is
available. `global_track_id` already embeds well/fov, so root-walks
inside each group only see track_ids from one biological FOV.

Existing parquets built with the old code carry the aliased lineage
IDs and need to be regenerated; a later commit can flag that at load
time once the rebuild lands.

Also adds:
- `_compute_valid_anchors`: includes `marker` in the validity key
  when present, matching the same-marker filter `_pick_temporal_candidate`
  enforces in flat-parquet / bag-of-channels mode.
- Unit tests: `TestReconstructLineage` in `test_cell_index.py` and
  `test_valid_anchors_marker.py` for the index fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebuilds the two timelapse parquets with the fixed `_reconstruct_lineage`
that scopes by (experiment, well, fov) instead of (experiment, fov).

- `collections/DynaCLR-2D-MIP-BagOfChannels-v2.yml`: copy of the
  unversioned collection YAML.
- `collections/DynaCLR-3D-BagOfChannels-v4.yml`: copy of v2 with
  the dragonfly `tracks_path` corrected to point at the nested
  `2024_08_14_ZIKV_pal17_48h.zarr` (zarr v2 tracking store; the
  outer `tracking.zarr` is just a container).
- Training configs updated to the new parquet paths.

Verified collision-free (0 cross-well lineage aliasing) on both:
- 2D-MIP v2: 3.36M rows across 32 experiments
- 3D-BoC v4: 766k rows across 26 experiments

Also drops the `SaveConfigToWandb` callback entry that was still
referenced in these two training configs (missed in 40ed2f7).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before: `val_dataloader` was a plain torch DataLoader with `shuffle=False`,
ignoring `batch_group_by` and `stratify_by`. That served val in parquet
order — one FOV/marker at a time — so the first N val batches all
shared the same marker (visually confirmed in dataloader_demo), and in
DDP `loss/val` ALLREDUCE silently desynced because each rank's shard
saw a different subset of markers.

After: val uses the same `FlexibleBatchSampler` as train with identical
`batch_group_by` / `stratify_by` / `group_weights` / `seed` settings.
For the BoC configs this means:
- mixed-markers (`batch_group_by=None`, `stratify_by=[perturbation,marker]`)
  produces diverse val batches that mirror train batches.
- single-marker (`batch_group_by=marker`) produces per-marker val batches
  that cycle through all markers across the val epoch instead of
  stalling on one.

Temporal enrichment is disabled for val (no biology-of-interest
oversampling skewing loss/val).

Also:
- `dataloader_demo.py`: add a "Validation dataloader" section that
  iterates val batches, flags NaN/Inf before and after normalization,
  and plots with the same `plot_batch` helper. Confirms val now serves
  diverse markers matching the train composition.
- `OnlineEvalCallback.effective_rank`: guard against NaN/Inf in features
  so a degenerate validation epoch can't crash the whole run with
  "SVD did not converge" from `np.linalg.svd`. Drops affected rows and
  returns NaN when no finite rows remain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Split the flat applications/dynaclr/configs/training/ directory into
per-family subfolders so related runs stay grouped and the root
directory is skimmable:

- DynaCLR-2D/  — 2D (and MIP) time-lapse contrastive runs
- DynaCLR-3D/  — 3D time-lapse contrastive runs
- DINOv3/      — DINOv3 frozen-encoder + MLP probes
- Phase-contrastive/ — Phase-contrastive-timeaware

Each .yml and its paired .sh stay together in the same folder. OPS/ is
organized separately (not included in this commit).

Mechanical updates:
- `base:` paths in leaf YAMLs rewritten from `recipes/...` to
  `../recipes/...` so composition still resolves relative to the YAML.
- `CONFIGS=` in each sbatch script now points at the new subfolder.
- `sbatch ...` comment headers in YAML and SH files updated.
- debug/ sbatch comment headers also updated for references to the
  renamed launch scripts.

Also:
- Deleted stale `slurm-287*.out` logs and the stray `wandb/` directory
  that had accumulated in the configs directory.
- Rewrote README.md to document the new layout, composition rules
  via `base:`, SLURM entry points, and resume semantics.

Verified composition still works via
`viscy_utils.compose.load_composed_config` on a representative yml
from each subfolder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant