fix: address review comments on aggregate_sample_features refactor#116
fix: address review comments on aggregate_sample_features refactor#116
Conversation
- validate per-slide shapes/lengths in aggregate_sample_features, raising informative ValueError with slide index and sample_id - tighten return/variable type annotations to dict[str, tuple[np.ndarray, np.ndarray]] - fix CLI memory regression: group by sample_id first, read H5 files per-sample so peak memory scales with the largest sample, not the full input set - test_multi_slide: assert concatenated content, not just shape - test_with_subsampling: assert reproducibility (same seed → same result) and features/coords alignment after subsampling - add test_aggregate_sample_features_invalid_shapes for the new per-slide validation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
b867488 to
5d6470d
Compare
There was a problem hiding this comment.
Pull request overview
Addresses prior review feedback on the aggregate_sample_features refactor by adding per-slide input validation, tightening typing, fixing a CLI memory regression, and strengthening tests around correctness and reproducibility.
Changes:
- Add per-slide validation in
aggregate_sample_featureswith informativeValueErrors for invalid feature/coord shapes and mismatched lengths. - Update the CLI to process inputs grouped by
sample_idand load H5s one sample at a time to reduce peak memory usage. - Expand tests to assert concatenated content (not just shapes), reproducibility with a fixed seed, feature/coord alignment after subsampling, and new invalid-shape coverage.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
mussel/utils/feature_extract.py |
Adds per-slide input validation and tightens return type annotations for aggregated outputs. |
mussel/cli/aggregate_sample_features.py |
Refactors CLI workflow to group slides by sample and load/process/write one sample at a time to reduce peak memory. |
tests/mussel/cli/test_aggregate_sample_features.py |
Strengthens assertions for concatenation/subsampling and adds coverage for new validation error paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.info( | ||
| "Subsampling to %d tiles per sample (strategy=%s, seed=%d)", | ||
| cfg.max_tiles, | ||
| cfg.subsampling_strategy, | ||
| cfg.seed, | ||
| ) |
There was a problem hiding this comment.
OrderedDict isn’t needed here (project requires Python >=3.10, where plain dict preserves insertion order). Consider replacing groups: dict = collections.OrderedDict() with a typed plain dict (e.g., dict[str, list[int]] = {} or defaultdict(list)) and drop the collections import to reduce overhead and tighten typing (similar to mussel/utils/feature_extract.py:1989).
Python >=3.10 guarantees insertion-order preservation in plain dict, so OrderedDict adds no value here. Also drops the unused collections import from the CLI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses all five review comments left on #115.
Changes
mussel/utils/feature_extract.pyValueErrorwith slide index andsample_idiffeats.ndim != 2,coordsshape is not(N, 2), orlen(feats) != len(coords)dicttodict[str, tuple[np.ndarray, np.ndarray]]mussel/cli/aggregate_sample_features.pysample_idfirst, then read H5 files one sample at a time; peak memory now scales with the largest sample rather than the full input settests/mussel/cli/test_aggregate_sample_features.pytest_aggregate_sample_features_multi_slide— assert concatenated content (not just shapes) vianp.testing.assert_array_equaltest_aggregate_sample_features_with_subsampling— assert reproducibility (same seed → identical outputs) and features/coords alignment after subsampling (using identifiable rows where feature value == coord value)test_aggregate_sample_features_invalid_shapes— new test covering the three new validation paths (1-D features, wrong coords shape, mismatched lengths)