Fix datasets import failing by developing standalone dataset system by marcovarrone · Pull Request #111 · CSOgroup/cellcharter

marcovarrone · 2026-02-12T10:16:24Z

Closed issues

Closes Datasets import fails #110

Summary by CodeRabbit

New Features
- Progress visualization for dataset downloads.
- More reliable dataset fetching with automatic download, integrity checks, caching, and direct loading of dataset files into the app.
Chores
- Bumped package version to 0.3.6.
- Raised minimum Python requirement to 3.11.
- Updated torchgmm to >=0.1.4 and re-added spatialdata and spatialdata-plot dependencies.
Tests
- More robust test fixtures and download handling for dataset-based tests.

…aset.py

coderabbitai · 2026-02-12T10:16:58Z

📝 Walkthrough

Walkthrough

Adds a structured dataset download/resolution pipeline (DatasetMetadata, HDF5 validation, FigShare URL normalization, safe downloading, cache resolution) for codex_mouse_spleen; updates packaging (version, Python requirement, torchgmm, re-adds spatialdata deps) and CI Python matrix; tests update to use local HDF5 fixture handling.

Changes

Cohort / File(s)	Summary
Configuration & Dependencies `pyproject.toml`, `.github/workflows/test.yaml`	Bump package version to `0.3.6`; `requires-python` changed to `>=3.11,<3.14`; `torchgmm` requirement raised to `>=0.1.4`; `spatialdata` and `spatialdata-plot` re-added; GitHub Actions matrix updated to Python 3.11–3.13.
Dataset System Refactor `src/cellcharter/datasets/_dataset.py`	Introduce `DatasetMetadata` dataclass and a download/resolution pipeline: cache dir resolution, FigShare URL normalization, HDF5 header validation, safe temp-file downloads with optional tqdm, path resolution and orchestration. `codex_mouse_spleen` now fetches via this pipeline and returns an `anndata.AnnData`.
Tests & Fixtures `tests/conftest.py`	Add HDF5 validation and download helpers for test fixture (`_is_hdf5_file`, `_download_codex`); codex_adata fixture now ensures local `.h5ad` validity, downloads if needed, and handles HTTP/URL/OSError retries and skipping logic.

Sequence Diagram

sequenceDiagram
    participant User as "User"
    participant CMS as "codex_mouse_spleen"
    participant Fetch as "_fetch_dataset_file"
    participant Resolve as "_resolve_dataset_path"
    participant Download as "_download_file"
    participant Validate as "_is_hdf5_file"
    participant Read as "anndata.read_h5ad"
    participant FS as "FileSystem"

    User->>CMS: call(path, force_download)
    CMS->>Fetch: orchestrate download/resolution
    Fetch->>Resolve: determine full path (metadata or direct)
    Resolve->>FS: check cache / path existence
    FS-->>Resolve: path or missing
    alt cached and valid
        Resolve-->>Fetch: return cached path
    else download needed
        Fetch->>Download: request download (FigShare normalization)
        Download->>FS: write temp file (show progress)
        Download->>Validate: check HDF5 header
        Validate-->>Download: valid / invalid
        Download->>FS: move temp -> final path
        Download-->>Fetch: return downloaded path
    end
    Fetch-->>CMS: dataset file path
    CMS->>Read: anndata.read_h5ad(path)
    Read-->>CMS: AnnData
    CMS-->>User: AnnData object

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped to FigShare, nose to the sky,
I fetched little files and watched bytes fly,
I checked every header, kept caches neat,
I handed you AnnData, tidy and sweet,
A tiny rabbit's hop — dataset complete.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: implementing a standalone dataset system to fix dataset import failures.
Linked Issues check	✅ Passed	The PR successfully addresses issue `#110` by removing squidpy dependency and implementing a standalone dataset download/validation system across multiple files.
Out of Scope Changes check	✅ Passed	All changes are scoped to implementing the standalone dataset system: version bumps, dependency updates, new dataset infrastructure, and test fixture updates.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch 110-dataset-import

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

tests/conftest.py (2)
25-42: Partial .part file not cleaned up if the download itself raises.

If urlopen succeeds but response.read() fails mid-stream (timeout, connection reset), the .part file is left on disk. While the next invocation would overwrite it, wrapping the download in try/finally keeps the filesystem tidy.

Also, per Ruff S310, consider validating the URL scheme before opening, as a defensive measure.
♻️ Proposed fix
 def _download_codex(path: Path) -> None:
     path.parent.mkdir(parents=True, exist_ok=True)
     tmp_path = path.with_suffix(f"{path.suffix}.part")
+    if not _CODEX_URL.startswith(("https://", "http://")):
+        raise ValueError(f"Unexpected URL scheme: {_CODEX_URL}")
     request = Request(_CODEX_URL, headers={"User-Agent": "cellcharter-tests"})
 
-    with urlopen(request, timeout=60) as response, tmp_path.open("wb") as output:
-        chunk_size = 1024 * 1024
-        while True:
-            chunk = response.read(chunk_size)
-            if not chunk:
-                break
-            output.write(chunk)
+    try:
+        with urlopen(request, timeout=60) as response, tmp_path.open("wb") as output:  # noqa: S310
+            chunk_size = 1024 * 1024
+            while True:
+                chunk = response.read(chunk_size)
+                if not chunk:
+                    break
+                output.write(chunk)
+    except Exception:
+        tmp_path.unlink(missing_ok=True)
+        raise
 
     if not _is_hdf5_file(tmp_path):
         tmp_path.unlink(missing_ok=True)
58-75: Good retry logic with session scope.

One edge case: a corrupt-but-valid-header HDF5 file could pass _is_hdf5_file but cause ad.read_h5ad to raise an exception outside (HTTPError, URLError, OSError) (e.g., KeyError from h5py). This would bypass the retry/cleanup logic and fail the test outright. Consider broadening the except clause to Exception (or at least adding KeyError) with the same cleanup/retry behavior, or alternatively validating with ad.read_h5ad inside the HDF5 check.
♻️ Suggested broadening of the except clause
-        except (HTTPError, URLError, OSError) as e:
+        except Exception as e:
             _CODEX_PATH.unlink(missing_ok=True)  # Force re-download on next attempt

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@pyproject.toml`:
- Around line 13-16: The inline comment on the requires-python metadata is stale
and misspells "multiscale": update or remove the comment after the
requires-python = ">=3.11,<3.14" line so it accurately reflects that 3.13 is
already included and fix the typo "multicale" to "multiscale" if keeping a note
(e.g., change to '# include 3.13 when spatialdata/multiscale-spatial-image
constraints require it' or simply remove the comment entirely); edit the comment
near the requires-python entry in pyproject.toml accordingly.

🧹 Nitpick comments (4)

pyproject.toml (1)
96-96: target-version for Black is stale.

target-version is set to ['py38'] but the minimum Python version is now 3.11. This won't break anything but means Black won't apply newer syntax transformations it could.
Proposed fix
-target-version = ['py38']
+target-version = ['py311']
src/cellcharter/datasets/_dataset.py (3)
56-93: Several small improvements for _download_file.

URL scheme audit (S310): urlopen will happily follow file:// or other schemes. Since the URL source is currently hardcoded metadata this is low risk, but adding a scheme check is cheap insurance, especially if this is ever reused with user-supplied URLs.

Invalid noqa directive on line 77: B902 is not a valid flake8-bugbear / ruff rule code. The directive should be removed.

raise error on line 82: Prefer bare raise to preserve the full traceback chain without repeating the exception name.
Proposed fixes
 def _download_file(url: str, destination: Path) -> None:
+    normalized_url = _normalize_figshare_url(url)
+    if urlparse(normalized_url).scheme not in ("http", "https"):
+        raise ValueError(f"Only http/https URLs are supported, got: {normalized_url}")
     destination.parent.mkdir(parents=True, exist_ok=True)
     tmp_destination = destination.with_suffix(f"{destination.suffix}.part")
 
-    request = Request(_normalize_figshare_url(url), headers={"User-Agent": "cellcharter-datasets"})
+    request = Request(normalized_url, headers={"User-Agent": "cellcharter-datasets"})
     progress = None
     try:
         with urlopen(request, timeout=60) as response, tmp_destination.open("wb") as output:
             ...
-    except Exception as error:  # noqa: B902
+    except Exception as error:
             ...
         if isinstance(error, (HTTPError, URLError)):
             raise RuntimeError(f"Failed to download dataset from {url}.") from error
-        raise error
+        raise
     finally:
48-53: _normalize_figshare_url — consider adding a brief docstring.

The logic is correct: it converts figshare.com/ndownloader/files/… to the ndownloader.figshare.com/files/… canonical download endpoint. However, the purpose isn't immediately obvious from the code alone. A one-line docstring would help future maintainers.

Also note the currently hardcoded _codex_mouse_spleen.url already uses the ndownloader.figshare.com form, so this function only activates if someone later supplies the alternate figshare.com/ndownloader/… variant.

96-100: Suffix-based heuristic in _resolve_dataset_path can misidentify directories.

base_dir.suffix returns a non-empty string for any path with a dot in its final component (e.g., a directory named my.datasets). This would cause the function to treat it as a file path and skip appending the filename.

Since this is an internal helper with a well-documented path parameter, the risk is low, but worth a brief comment or a slightly more robust check (e.g., checking if the path already exists as a directory).
Slightly more robust alternative
 def _resolve_dataset_path(metadata: DatasetMetadata, path: str | Path | None = None) -> Path:
     base_dir = _default_datasets_dir() if path is None else Path(path)
-    if base_dir.suffix:
+    if base_dir.suffix and not base_dir.is_dir():
         return base_dir
     return base_dir / metadata.filename

pyproject.toml

- Introduced _is_hdf5_file and _download_codex functions to handle HDF5 file validation and downloading. - Updated codex_adata fixture to check for the existence of the codex_adata file and download it if necessary. - Enhanced error handling to include URLError and OSError during data retrieval.

marcovarrone added 4 commits February 12, 2026 10:35

Refactor dataset handling in _dataset.py

aa84f45

Enhance Figshare URL normalization and improve error handling in _dat…

aa683cc

…aset.py

Update dependencies and bump to 0.3.6

dbf29b6

Update torchgmm dependency version to 0.1.4 in pyproject.toml

2dd3f8b

coderabbitai bot reviewed Feb 12, 2026

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

marcovarrone added 4 commits February 12, 2026 11:20

Update Python versions in GitHub Actions workflow to 3.11 and 3.13

0c680f3

Remove comment

6895994

Increase timeout for codex download and ensure re-download on failure

27cd77d

marcovarrone merged commit 6dea763 into main Feb 12, 2026
4 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix datasets import failing by developing standalone dataset system#111

Fix datasets import failing by developing standalone dataset system#111
marcovarrone merged 8 commits intomainfrom
110-dataset-import

marcovarrone commented Feb 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 12, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marcovarrone commented Feb 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Closed issues

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marcovarrone commented Feb 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 12, 2026 •

edited

Loading