fix: address review comments on aggregate_sample_features refactor by raylim · Pull Request #116 · pathology-data-mining/Mussel

raylim · 2026-04-03T17:11:00Z

Addresses all five review comments left on #115.

Changes

`mussel/utils/feature_extract.py`

Input validation — validate each slide's arrays before processing: raises a ValueError with slide index and sample_id if feats.ndim != 2, coords shape is not (N, 2), or len(feats) != len(coords)
Tighter type annotations — return type and internal variables changed from bare dict to dict[str, tuple[np.ndarray, np.ndarray]]

`mussel/cli/aggregate_sample_features.py`

Fix memory regression — group slide indices by sample_id first, then read H5 files one sample at a time; peak memory now scales with the largest sample rather than the full input set

`tests/mussel/cli/test_aggregate_sample_features.py`

test_aggregate_sample_features_multi_slide — assert concatenated content (not just shapes) via np.testing.assert_array_equal
test_aggregate_sample_features_with_subsampling — assert reproducibility (same seed → identical outputs) and features/coords alignment after subsampling (using identifiable rows where feature value == coord value)
test_aggregate_sample_features_invalid_shapes — new test covering the three new validation paths (1-D features, wrong coords shape, mismatched lengths)

- validate per-slide shapes/lengths in aggregate_sample_features, raising informative ValueError with slide index and sample_id - tighten return/variable type annotations to dict[str, tuple[np.ndarray, np.ndarray]] - fix CLI memory regression: group by sample_id first, read H5 files per-sample so peak memory scales with the largest sample, not the full input set - test_multi_slide: assert concatenated content, not just shape - test_with_subsampling: assert reproducibility (same seed → same result) and features/coords alignment after subsampling - add test_aggregate_sample_features_invalid_shapes for the new per-slide validation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Addresses prior review feedback on the aggregate_sample_features refactor by adding per-slide input validation, tightening typing, fixing a CLI memory regression, and strengthening tests around correctness and reproducibility.

Changes:

Add per-slide validation in aggregate_sample_features with informative ValueErrors for invalid feature/coord shapes and mismatched lengths.
Update the CLI to process inputs grouped by sample_id and load H5s one sample at a time to reduce peak memory usage.
Expand tests to assert concatenated content (not just shapes), reproducibility with a fixed seed, feature/coord alignment after subsampling, and new invalid-shape coverage.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`mussel/utils/feature_extract.py`	Adds per-slide input validation and tightens return type annotations for aggregated outputs.
`mussel/cli/aggregate_sample_features.py`	Refactors CLI workflow to group slides by sample and load/process/write one sample at a time to reduce peak memory.
`tests/mussel/cli/test_aggregate_sample_features.py`	Strengthens assertions for concatenation/subsampling and adds coverage for new validation error paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T17:13:39Z

mussel/cli/aggregate_sample_features.py

        logger.info(
            "Subsampling to %d tiles per sample (strategy=%s, seed=%d)",
            cfg.max_tiles,
            cfg.subsampling_strategy,
            cfg.seed,
        )


OrderedDict isn’t needed here (project requires Python >=3.10, where plain dict preserves insertion order). Consider replacing groups: dict = collections.OrderedDict() with a typed plain dict (e.g., dict[str, list[int]] = {} or defaultdict(list)) and drop the collections import to reduce overhead and tighten typing (similar to mussel/utils/feature_extract.py:1989).

Python >=3.10 guarantees insertion-order preservation in plain dict, so OrderedDict adds no value here. Also drops the unused collections import from the CLI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 3, 2026 17:11

Copilot started reviewing on behalf of raylim April 3, 2026 17:11 View session

raylim force-pushed the fix/aggregate-sample-features-followup branch from b867488 to 5d6470d Compare April 3, 2026 17:13

Copilot AI reviewed Apr 3, 2026

View reviewed changes

refactor: replace collections.OrderedDict with plain dict

997bdf9

Python >=3.10 guarantees insertion-order preservation in plain dict, so OrderedDict adds no value here. Also drops the unused collections import from the CLI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

raylim merged commit 2d21a83 into main Apr 3, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: address review comments on aggregate_sample_features refactor#116

fix: address review comments on aggregate_sample_features refactor#116
raylim merged 2 commits intomainfrom
fix/aggregate-sample-features-followup

raylim commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raylim commented Apr 3, 2026

Changes

mussel/utils/feature_extract.py

mussel/cli/aggregate_sample_features.py

tests/mussel/cli/test_aggregate_sample_features.py

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`mussel/utils/feature_extract.py`

`mussel/cli/aggregate_sample_features.py`

`tests/mussel/cli/test_aggregate_sample_features.py`