Skip to content

fix: address review comments on aggregate_sample_features refactor#116

Merged
raylim merged 2 commits intomainfrom
fix/aggregate-sample-features-followup
Apr 3, 2026
Merged

fix: address review comments on aggregate_sample_features refactor#116
raylim merged 2 commits intomainfrom
fix/aggregate-sample-features-followup

Conversation

@raylim
Copy link
Copy Markdown
Contributor

@raylim raylim commented Apr 3, 2026

Addresses all five review comments left on #115.

Changes

mussel/utils/feature_extract.py

  • Input validation — validate each slide's arrays before processing: raises a ValueError with slide index and sample_id if feats.ndim != 2, coords shape is not (N, 2), or len(feats) != len(coords)
  • Tighter type annotations — return type and internal variables changed from bare dict to dict[str, tuple[np.ndarray, np.ndarray]]

mussel/cli/aggregate_sample_features.py

  • Fix memory regression — group slide indices by sample_id first, then read H5 files one sample at a time; peak memory now scales with the largest sample rather than the full input set

tests/mussel/cli/test_aggregate_sample_features.py

  • test_aggregate_sample_features_multi_slide — assert concatenated content (not just shapes) via np.testing.assert_array_equal
  • test_aggregate_sample_features_with_subsampling — assert reproducibility (same seed → identical outputs) and features/coords alignment after subsampling (using identifiable rows where feature value == coord value)
  • test_aggregate_sample_features_invalid_shapes — new test covering the three new validation paths (1-D features, wrong coords shape, mismatched lengths)

Copilot AI review requested due to automatic review settings April 3, 2026 17:11
- validate per-slide shapes/lengths in aggregate_sample_features,
  raising informative ValueError with slide index and sample_id
- tighten return/variable type annotations to
  dict[str, tuple[np.ndarray, np.ndarray]]
- fix CLI memory regression: group by sample_id first, read H5 files
  per-sample so peak memory scales with the largest sample, not the
  full input set
- test_multi_slide: assert concatenated content, not just shape
- test_with_subsampling: assert reproducibility (same seed → same
  result) and features/coords alignment after subsampling
- add test_aggregate_sample_features_invalid_shapes for the new
  per-slide validation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@raylim raylim force-pushed the fix/aggregate-sample-features-followup branch from b867488 to 5d6470d Compare April 3, 2026 17:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses prior review feedback on the aggregate_sample_features refactor by adding per-slide input validation, tightening typing, fixing a CLI memory regression, and strengthening tests around correctness and reproducibility.

Changes:

  • Add per-slide validation in aggregate_sample_features with informative ValueErrors for invalid feature/coord shapes and mismatched lengths.
  • Update the CLI to process inputs grouped by sample_id and load H5s one sample at a time to reduce peak memory usage.
  • Expand tests to assert concatenated content (not just shapes), reproducibility with a fixed seed, feature/coord alignment after subsampling, and new invalid-shape coverage.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
mussel/utils/feature_extract.py Adds per-slide input validation and tightens return type annotations for aggregated outputs.
mussel/cli/aggregate_sample_features.py Refactors CLI workflow to group slides by sample and load/process/write one sample at a time to reduce peak memory.
tests/mussel/cli/test_aggregate_sample_features.py Strengthens assertions for concatenation/subsampling and adds coverage for new validation error paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 114 to 119
logger.info(
"Subsampling to %d tiles per sample (strategy=%s, seed=%d)",
cfg.max_tiles,
cfg.subsampling_strategy,
cfg.seed,
)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrderedDict isn’t needed here (project requires Python >=3.10, where plain dict preserves insertion order). Consider replacing groups: dict = collections.OrderedDict() with a typed plain dict (e.g., dict[str, list[int]] = {} or defaultdict(list)) and drop the collections import to reduce overhead and tighten typing (similar to mussel/utils/feature_extract.py:1989).

Copilot uses AI. Check for mistakes.
Python >=3.10 guarantees insertion-order preservation in plain dict,
so OrderedDict adds no value here. Also drops the unused collections
import from the CLI.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@raylim raylim merged commit 2d21a83 into main Apr 3, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants