refactor: aggregate_sample_features accepts arrays instead of file paths by raylim · Pull Request #115 · pathology-data-mining/Mussel

raylim · 2026-04-03T16:56:04Z

Summary

Refactors aggregate_sample_features in mussel/utils/feature_extract.py to follow the same pattern as other utility functions (e.g. get_features, subsample_tiles): accept in-memory objects, not file paths.

Changes

`mussel/utils/feature_extract.py`

Before: accepted patch_features_h5_paths: List[str], output_dir: str, output_h5_suffix: str; read H5 files internally; wrote output H5 files; returned None
After: accepts features_list: List[np.ndarray], coords_list: List[np.ndarray]; no file I/O; returns dict[str, tuple[np.ndarray, np.ndarray]] mapping sample_id → (features, coords)

`mussel/cli/aggregate_sample_features.py`

CLI wrapper now owns all file I/O: reads H5 inputs before calling the utility, writes output H5 files after it returns
Same end-to-end behaviour for CLI users — no change to CLI interface

`tests/mussel/cli/test_aggregate_sample_features.py`

Utility function tests now use np.ndarray objects directly (no temp H5 files needed)
CLI tests unchanged (still exercise the full H5 read/write path through the CLI)

- utility function now takes features_list/coords_list (List[np.ndarray]) instead of patch_features_h5_paths (List[str]) and returns an ordered dict mapping sample_id -> (features, coords) instead of writing files - file I/O (reading H5 inputs, writing output H5s) moved to CLI wrapper, consistent with the pattern used by other utility functions - tests for the utility function now use in-memory numpy arrays directly Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Refactors aggregate_sample_features to be a pure in-memory utility (accepting per-slide feature/coord arrays and returning per-sample aggregated arrays), moving all HDF5 read/write responsibilities into the CLI wrapper to match other utility patterns in the codebase.

Changes:

Updated aggregate_sample_features to accept features_list/coords_list and return per-sample aggregated results (no file I/O).
Updated CLI command to read per-slide H5 inputs into arrays, call the utility, and write per-sample H5 outputs.
Updated unit tests to exercise the utility via in-memory arrays while keeping end-to-end CLI H5 tests.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`mussel/utils/feature_extract.py`	Refactors aggregation to operate on arrays and return a results mapping instead of writing H5 files.
`mussel/cli/aggregate_sample_features.py`	Shifts H5 read/write into the CLI wrapper around the refactored utility.
`tests/mussel/cli/test_aggregate_sample_features.py`	Updates utility tests to avoid temp H5 files; retains CLI end-to-end tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T16:59:28Z

mussel/utils/feature_extract.py

        for i in indices:
-            h5_path = patch_features_h5_paths[i]
-            with h5py.File(h5_path, "r") as h5:
-                feats = np.array(h5["features"])
-                coords = h5["coords"][:]
+            feats = features_list[i]
+            coords = coords_list[i]
            all_features.append(feats)
            all_coords.append(coords)
            slide_sizes.append(len(feats))
-            logger.debug("  slide %d: %d tiles from %s", i, len(feats), h5_path)
+            logger.debug("  slide %d: %d tiles", i, len(feats))


aggregate_sample_features now assumes each features_list[i] and coords_list[i] pair is consistent, but it never validates shapes/lengths. If a slide has len(coords) != len(feats) (or unexpected dimensionality), np.concatenate / subsample_tiles can later fail with a less-informative error or silently produce inconsistent outputs. Consider validating per-slide that feats.ndim == 2, coords.ndim == 2, coords.shape[1] == 2, and len(coords) == len(feats) and raising a ValueError that includes the offending slide index/sample_id.

Copilot · 2026-04-03T16:59:29Z

mussel/utils/feature_extract.py

 def aggregate_sample_features(
-    patch_features_h5_paths: List[str],
+    features_list: List[np.ndarray],
+    coords_list: List[np.ndarray],
    sample_ids: List[str],
-    output_dir: str,
-    output_h5_suffix: str = "features.h5",
    max_tiles: Optional[int] = None,
    subsampling_strategy: str = "random",
    seed: int = 42,
-) -> None:
-    """Concatenate per-slide patch features into one H5 per sample.
+) -> dict:
+    """Concatenate per-slide patch features into one array per sample.


The return type annotation ) -> dict: (and groups: dict / results: dict) is very generic and doesn’t match the documented API (“Ordered dict mapping each unique sample_id to a (features, coords) tuple”). Please tighten the typing to something like collections.OrderedDict[str, tuple[np.ndarray, np.ndarray]] (or dict[str, tuple[np.ndarray, np.ndarray]] if you don’t want to expose OrderedDict) to make the new in-memory API easier/safer to consume.

Copilot · 2026-04-03T16:59:29Z

tests/mussel/cli/test_aggregate_sample_features.py

+    results = _aggregate_sample_features(
+        features_list=[feats_a, feats_b],
+        coords_list=[coords_a, coords_b],
        sample_ids=["sampleX", "sampleX"],
-        output_dir=str(tmp_path / "out"),
        max_tiles=None,
    )

-    out_h5 = tmp_path / "out" / "sampleX.features.h5"
-    assert out_h5.exists()
-    with h5py.File(out_h5) as f:
-        assert f["features"].shape == (35, 4)
-        assert f["coords"].shape == (35, 2)
+    assert results["sampleX"][0].shape == (35, 4)
+    assert results["sampleX"][1].shape == (35, 2)


test_aggregate_sample_features_multi_slide only asserts output shapes. Since this PR changes aggregate_sample_features behavior significantly, it would be more robust to also assert that the returned arrays equal the expected concatenation (e.g., np.concatenate([feats_a, feats_b]) / np.concatenate([coords_a, coords_b])) so ordering/content regressions are caught.

Copilot · 2026-04-03T16:59:29Z

tests/mussel/cli/test_aggregate_sample_features.py

 def test_aggregate_sample_features_with_subsampling(tmp_path):
    """Subsampling reduces output to max_tiles."""
-    h5_a = tmp_path / "s0.h5"
-    h5_b = tmp_path / "s1.h5"
-    _write_fake_h5(h5_a, n_tiles=80, seed=0)
-    _write_fake_h5(h5_b, n_tiles=60, seed=1)
-
-    _aggregate_sample_features(
-        patch_features_h5_paths=[str(h5_a), str(h5_b)],
+    rng = np.random.default_rng(0)
+    feats_a = rng.random((80, 4)).astype(np.float32)
+    coords_a = rng.integers(0, 1000, (80, 2))
+    feats_b = rng.random((60, 4)).astype(np.float32)
+    coords_b = rng.integers(0, 1000, (60, 2))
+
+    results = _aggregate_sample_features(
+        features_list=[feats_a, feats_b],
+        coords_list=[coords_a, coords_b],
        sample_ids=["big", "big"],
-        output_dir=str(tmp_path / "out"),
        max_tiles=50,
        subsampling_strategy="random",
        seed=99,
    )

-    out_h5 = tmp_path / "out" / "big.features.h5"
-    with h5py.File(out_h5) as f:
-        assert f["features"].shape[0] == 50
+    assert results["big"][0].shape[0] == 50


test_aggregate_sample_features_with_subsampling currently only checks that the output tile count is reduced to max_tiles. Consider also validating that subsampling is reproducible given seed and that features and coords remain aligned (e.g., by using easily-identifiable rows/coords and asserting they correspond) to catch potential indexing bugs in the aggregation path.

Copilot · 2026-04-03T16:59:29Z

mussel/cli/aggregate_sample_features.py

+    features_list = []
+    coords_list = []
+    for h5_path in patch_features_h5_paths:
+        with h5py.File(h5_path, "r") as h5:
+            features_list.append(np.array(h5["features"]))
+            coords_list.append(h5["coords"][:])
+
+    results = aggregate_sample_features(
+        features_list=features_list,
+        coords_list=coords_list,
+        sample_ids=sample_ids,


The CLI now loads all slide feature arrays into memory upfront (features_list / coords_list) before grouping/aggregating. Previously, aggregation read H5s within the per-sample loop, so peak memory scaled with the largest sample rather than the entire input set. For large cohorts this can be a substantial memory regression and risk OOM; consider grouping indices by sample_id first, then reading/aggregating one sample at a time (or otherwise streaming) so only one sample’s slides are resident at once.

Copilot AI review requested due to automatic review settings April 3, 2026 16:56

Copilot started reviewing on behalf of raylim April 3, 2026 16:56 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

raylim merged commit 844ea67 into main Apr 3, 2026
6 checks passed

raylim mentioned this pull request Apr 3, 2026

fix: address review comments on aggregate_sample_features refactor #116

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: aggregate_sample_features accepts arrays instead of file paths#115

refactor: aggregate_sample_features accepts arrays instead of file paths#115
raylim merged 1 commit intomainfrom
refactor/aggregate-sample-features-use-objects

raylim commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raylim commented Apr 3, 2026

Summary

Changes

mussel/utils/feature_extract.py

mussel/cli/aggregate_sample_features.py

tests/mussel/cli/test_aggregate_sample_features.py

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`mussel/utils/feature_extract.py`

`mussel/cli/aggregate_sample_features.py`

`tests/mussel/cli/test_aggregate_sample_features.py`