Skip to content

refactor: aggregate_sample_features accepts arrays instead of file paths#115

Merged
raylim merged 1 commit intomainfrom
refactor/aggregate-sample-features-use-objects
Apr 3, 2026
Merged

refactor: aggregate_sample_features accepts arrays instead of file paths#115
raylim merged 1 commit intomainfrom
refactor/aggregate-sample-features-use-objects

Conversation

@raylim
Copy link
Copy Markdown
Contributor

@raylim raylim commented Apr 3, 2026

Summary

Refactors aggregate_sample_features in mussel/utils/feature_extract.py to follow the same pattern as other utility functions (e.g. get_features, subsample_tiles): accept in-memory objects, not file paths.

Changes

mussel/utils/feature_extract.py

  • Before: accepted patch_features_h5_paths: List[str], output_dir: str, output_h5_suffix: str; read H5 files internally; wrote output H5 files; returned None
  • After: accepts features_list: List[np.ndarray], coords_list: List[np.ndarray]; no file I/O; returns dict[str, tuple[np.ndarray, np.ndarray]] mapping sample_id → (features, coords)

mussel/cli/aggregate_sample_features.py

  • CLI wrapper now owns all file I/O: reads H5 inputs before calling the utility, writes output H5 files after it returns
  • Same end-to-end behaviour for CLI users — no change to CLI interface

tests/mussel/cli/test_aggregate_sample_features.py

  • Utility function tests now use np.ndarray objects directly (no temp H5 files needed)
  • CLI tests unchanged (still exercise the full H5 read/write path through the CLI)

- utility function now takes features_list/coords_list (List[np.ndarray])
  instead of patch_features_h5_paths (List[str]) and returns an ordered
  dict mapping sample_id -> (features, coords) instead of writing files
- file I/O (reading H5 inputs, writing output H5s) moved to CLI wrapper,
  consistent with the pattern used by other utility functions
- tests for the utility function now use in-memory numpy arrays directly

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 3, 2026 16:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors aggregate_sample_features to be a pure in-memory utility (accepting per-slide feature/coord arrays and returning per-sample aggregated arrays), moving all HDF5 read/write responsibilities into the CLI wrapper to match other utility patterns in the codebase.

Changes:

  • Updated aggregate_sample_features to accept features_list/coords_list and return per-sample aggregated results (no file I/O).
  • Updated CLI command to read per-slide H5 inputs into arrays, call the utility, and write per-sample H5 outputs.
  • Updated unit tests to exercise the utility via in-memory arrays while keeping end-to-end CLI H5 tests.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
mussel/utils/feature_extract.py Refactors aggregation to operate on arrays and return a results mapping instead of writing H5 files.
mussel/cli/aggregate_sample_features.py Shifts H5 read/write into the CLI wrapper around the refactored utility.
tests/mussel/cli/test_aggregate_sample_features.py Updates utility tests to avoid temp H5 files; retains CLI end-to-end tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1957 to +1963
for i in indices:
h5_path = patch_features_h5_paths[i]
with h5py.File(h5_path, "r") as h5:
feats = np.array(h5["features"])
coords = h5["coords"][:]
feats = features_list[i]
coords = coords_list[i]
all_features.append(feats)
all_coords.append(coords)
slide_sizes.append(len(feats))
logger.debug(" slide %d: %d tiles from %s", i, len(feats), h5_path)
logger.debug(" slide %d: %d tiles", i, len(feats))
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggregate_sample_features now assumes each features_list[i] and coords_list[i] pair is consistent, but it never validates shapes/lengths. If a slide has len(coords) != len(feats) (or unexpected dimensionality), np.concatenate / subsample_tiles can later fail with a less-informative error or silently produce inconsistent outputs. Consider validating per-slide that feats.ndim == 2, coords.ndim == 2, coords.shape[1] == 2, and len(coords) == len(feats) and raising a ValueError that includes the offending slide index/sample_id.

Copilot uses AI. Check for mistakes.
Comment on lines 1906 to +1914
def aggregate_sample_features(
patch_features_h5_paths: List[str],
features_list: List[np.ndarray],
coords_list: List[np.ndarray],
sample_ids: List[str],
output_dir: str,
output_h5_suffix: str = "features.h5",
max_tiles: Optional[int] = None,
subsampling_strategy: str = "random",
seed: int = 42,
) -> None:
"""Concatenate per-slide patch features into one H5 per sample.
) -> dict:
"""Concatenate per-slide patch features into one array per sample.
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type annotation ) -> dict: (and groups: dict / results: dict) is very generic and doesn’t match the documented API (“Ordered dict mapping each unique sample_id to a (features, coords) tuple”). Please tighten the typing to something like collections.OrderedDict[str, tuple[np.ndarray, np.ndarray]] (or dict[str, tuple[np.ndarray, np.ndarray]] if you don’t want to expose OrderedDict) to make the new in-memory API easier/safer to consume.

Copilot uses AI. Check for mistakes.
Comment on lines +152 to +160
results = _aggregate_sample_features(
features_list=[feats_a, feats_b],
coords_list=[coords_a, coords_b],
sample_ids=["sampleX", "sampleX"],
output_dir=str(tmp_path / "out"),
max_tiles=None,
)

out_h5 = tmp_path / "out" / "sampleX.features.h5"
assert out_h5.exists()
with h5py.File(out_h5) as f:
assert f["features"].shape == (35, 4)
assert f["coords"].shape == (35, 2)
assert results["sampleX"][0].shape == (35, 4)
assert results["sampleX"][1].shape == (35, 2)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_aggregate_sample_features_multi_slide only asserts output shapes. Since this PR changes aggregate_sample_features behavior significantly, it would be more robust to also assert that the returned arrays equal the expected concatenation (e.g., np.concatenate([feats_a, feats_b]) / np.concatenate([coords_a, coords_b])) so ordering/content regressions are caught.

Copilot uses AI. Check for mistakes.
Comment on lines 181 to +198
def test_aggregate_sample_features_with_subsampling(tmp_path):
"""Subsampling reduces output to max_tiles."""
h5_a = tmp_path / "s0.h5"
h5_b = tmp_path / "s1.h5"
_write_fake_h5(h5_a, n_tiles=80, seed=0)
_write_fake_h5(h5_b, n_tiles=60, seed=1)

_aggregate_sample_features(
patch_features_h5_paths=[str(h5_a), str(h5_b)],
rng = np.random.default_rng(0)
feats_a = rng.random((80, 4)).astype(np.float32)
coords_a = rng.integers(0, 1000, (80, 2))
feats_b = rng.random((60, 4)).astype(np.float32)
coords_b = rng.integers(0, 1000, (60, 2))

results = _aggregate_sample_features(
features_list=[feats_a, feats_b],
coords_list=[coords_a, coords_b],
sample_ids=["big", "big"],
output_dir=str(tmp_path / "out"),
max_tiles=50,
subsampling_strategy="random",
seed=99,
)

out_h5 = tmp_path / "out" / "big.features.h5"
with h5py.File(out_h5) as f:
assert f["features"].shape[0] == 50
assert results["big"][0].shape[0] == 50
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_aggregate_sample_features_with_subsampling currently only checks that the output tile count is reduced to max_tiles. Consider also validating that subsampling is reproducible given seed and that features and coords remain aligned (e.g., by using easily-identifiable rows/coords and asserting they correspond) to catch potential indexing bugs in the aggregation path.

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +123
features_list = []
coords_list = []
for h5_path in patch_features_h5_paths:
with h5py.File(h5_path, "r") as h5:
features_list.append(np.array(h5["features"]))
coords_list.append(h5["coords"][:])

results = aggregate_sample_features(
features_list=features_list,
coords_list=coords_list,
sample_ids=sample_ids,
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI now loads all slide feature arrays into memory upfront (features_list / coords_list) before grouping/aggregating. Previously, aggregation read H5s within the per-sample loop, so peak memory scaled with the largest sample rather than the entire input set. For large cohorts this can be a substantial memory regression and risk OOM; consider grouping indices by sample_id first, then reading/aggregating one sample at a time (or otherwise streaming) so only one sample’s slides are resident at once.

Copilot uses AI. Check for mistakes.
@raylim raylim merged commit 844ea67 into main Apr 3, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants