Conversation
- utility function now takes features_list/coords_list (List[np.ndarray]) instead of patch_features_h5_paths (List[str]) and returns an ordered dict mapping sample_id -> (features, coords) instead of writing files - file I/O (reading H5 inputs, writing output H5s) moved to CLI wrapper, consistent with the pattern used by other utility functions - tests for the utility function now use in-memory numpy arrays directly Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Refactors aggregate_sample_features to be a pure in-memory utility (accepting per-slide feature/coord arrays and returning per-sample aggregated arrays), moving all HDF5 read/write responsibilities into the CLI wrapper to match other utility patterns in the codebase.
Changes:
- Updated
aggregate_sample_featuresto acceptfeatures_list/coords_listand return per-sample aggregated results (no file I/O). - Updated CLI command to read per-slide H5 inputs into arrays, call the utility, and write per-sample H5 outputs.
- Updated unit tests to exercise the utility via in-memory arrays while keeping end-to-end CLI H5 tests.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
mussel/utils/feature_extract.py |
Refactors aggregation to operate on arrays and return a results mapping instead of writing H5 files. |
mussel/cli/aggregate_sample_features.py |
Shifts H5 read/write into the CLI wrapper around the refactored utility. |
tests/mussel/cli/test_aggregate_sample_features.py |
Updates utility tests to avoid temp H5 files; retains CLI end-to-end tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for i in indices: | ||
| h5_path = patch_features_h5_paths[i] | ||
| with h5py.File(h5_path, "r") as h5: | ||
| feats = np.array(h5["features"]) | ||
| coords = h5["coords"][:] | ||
| feats = features_list[i] | ||
| coords = coords_list[i] | ||
| all_features.append(feats) | ||
| all_coords.append(coords) | ||
| slide_sizes.append(len(feats)) | ||
| logger.debug(" slide %d: %d tiles from %s", i, len(feats), h5_path) | ||
| logger.debug(" slide %d: %d tiles", i, len(feats)) |
There was a problem hiding this comment.
aggregate_sample_features now assumes each features_list[i] and coords_list[i] pair is consistent, but it never validates shapes/lengths. If a slide has len(coords) != len(feats) (or unexpected dimensionality), np.concatenate / subsample_tiles can later fail with a less-informative error or silently produce inconsistent outputs. Consider validating per-slide that feats.ndim == 2, coords.ndim == 2, coords.shape[1] == 2, and len(coords) == len(feats) and raising a ValueError that includes the offending slide index/sample_id.
| def aggregate_sample_features( | ||
| patch_features_h5_paths: List[str], | ||
| features_list: List[np.ndarray], | ||
| coords_list: List[np.ndarray], | ||
| sample_ids: List[str], | ||
| output_dir: str, | ||
| output_h5_suffix: str = "features.h5", | ||
| max_tiles: Optional[int] = None, | ||
| subsampling_strategy: str = "random", | ||
| seed: int = 42, | ||
| ) -> None: | ||
| """Concatenate per-slide patch features into one H5 per sample. | ||
| ) -> dict: | ||
| """Concatenate per-slide patch features into one array per sample. |
There was a problem hiding this comment.
The return type annotation ) -> dict: (and groups: dict / results: dict) is very generic and doesn’t match the documented API (“Ordered dict mapping each unique sample_id to a (features, coords) tuple”). Please tighten the typing to something like collections.OrderedDict[str, tuple[np.ndarray, np.ndarray]] (or dict[str, tuple[np.ndarray, np.ndarray]] if you don’t want to expose OrderedDict) to make the new in-memory API easier/safer to consume.
| results = _aggregate_sample_features( | ||
| features_list=[feats_a, feats_b], | ||
| coords_list=[coords_a, coords_b], | ||
| sample_ids=["sampleX", "sampleX"], | ||
| output_dir=str(tmp_path / "out"), | ||
| max_tiles=None, | ||
| ) | ||
|
|
||
| out_h5 = tmp_path / "out" / "sampleX.features.h5" | ||
| assert out_h5.exists() | ||
| with h5py.File(out_h5) as f: | ||
| assert f["features"].shape == (35, 4) | ||
| assert f["coords"].shape == (35, 2) | ||
| assert results["sampleX"][0].shape == (35, 4) | ||
| assert results["sampleX"][1].shape == (35, 2) |
There was a problem hiding this comment.
test_aggregate_sample_features_multi_slide only asserts output shapes. Since this PR changes aggregate_sample_features behavior significantly, it would be more robust to also assert that the returned arrays equal the expected concatenation (e.g., np.concatenate([feats_a, feats_b]) / np.concatenate([coords_a, coords_b])) so ordering/content regressions are caught.
| def test_aggregate_sample_features_with_subsampling(tmp_path): | ||
| """Subsampling reduces output to max_tiles.""" | ||
| h5_a = tmp_path / "s0.h5" | ||
| h5_b = tmp_path / "s1.h5" | ||
| _write_fake_h5(h5_a, n_tiles=80, seed=0) | ||
| _write_fake_h5(h5_b, n_tiles=60, seed=1) | ||
|
|
||
| _aggregate_sample_features( | ||
| patch_features_h5_paths=[str(h5_a), str(h5_b)], | ||
| rng = np.random.default_rng(0) | ||
| feats_a = rng.random((80, 4)).astype(np.float32) | ||
| coords_a = rng.integers(0, 1000, (80, 2)) | ||
| feats_b = rng.random((60, 4)).astype(np.float32) | ||
| coords_b = rng.integers(0, 1000, (60, 2)) | ||
|
|
||
| results = _aggregate_sample_features( | ||
| features_list=[feats_a, feats_b], | ||
| coords_list=[coords_a, coords_b], | ||
| sample_ids=["big", "big"], | ||
| output_dir=str(tmp_path / "out"), | ||
| max_tiles=50, | ||
| subsampling_strategy="random", | ||
| seed=99, | ||
| ) | ||
|
|
||
| out_h5 = tmp_path / "out" / "big.features.h5" | ||
| with h5py.File(out_h5) as f: | ||
| assert f["features"].shape[0] == 50 | ||
| assert results["big"][0].shape[0] == 50 |
There was a problem hiding this comment.
test_aggregate_sample_features_with_subsampling currently only checks that the output tile count is reduced to max_tiles. Consider also validating that subsampling is reproducible given seed and that features and coords remain aligned (e.g., by using easily-identifiable rows/coords and asserting they correspond) to catch potential indexing bugs in the aggregation path.
| features_list = [] | ||
| coords_list = [] | ||
| for h5_path in patch_features_h5_paths: | ||
| with h5py.File(h5_path, "r") as h5: | ||
| features_list.append(np.array(h5["features"])) | ||
| coords_list.append(h5["coords"][:]) | ||
|
|
||
| results = aggregate_sample_features( | ||
| features_list=features_list, | ||
| coords_list=coords_list, | ||
| sample_ids=sample_ids, |
There was a problem hiding this comment.
The CLI now loads all slide feature arrays into memory upfront (features_list / coords_list) before grouping/aggregating. Previously, aggregation read H5s within the per-sample loop, so peak memory scaled with the largest sample rather than the entire input set. For large cohorts this can be a substantial memory regression and risk OOM; consider grouping indices by sample_id first, then reading/aggregating one sample at a time (or otherwise streaming) so only one sample’s slides are resident at once.
Summary
Refactors
aggregate_sample_featuresinmussel/utils/feature_extract.pyto follow the same pattern as other utility functions (e.g.get_features,subsample_tiles): accept in-memory objects, not file paths.Changes
mussel/utils/feature_extract.pypatch_features_h5_paths: List[str],output_dir: str,output_h5_suffix: str; read H5 files internally; wrote output H5 files; returnedNonefeatures_list: List[np.ndarray],coords_list: List[np.ndarray]; no file I/O; returnsdict[str, tuple[np.ndarray, np.ndarray]]mappingsample_id → (features, coords)mussel/cli/aggregate_sample_features.pytests/mussel/cli/test_aggregate_sample_features.pynp.ndarrayobjects directly (no temp H5 files needed)