Skip to content

Fix datasets import failing by developing standalone dataset system#111

Merged
marcovarrone merged 8 commits intomainfrom
110-dataset-import
Feb 12, 2026
Merged

Fix datasets import failing by developing standalone dataset system#111
marcovarrone merged 8 commits intomainfrom
110-dataset-import

Conversation

@marcovarrone
Copy link
Collaborator

@marcovarrone marcovarrone commented Feb 12, 2026

Closed issues

Summary by CodeRabbit

  • New Features

    • Progress visualization for dataset downloads.
    • More reliable dataset fetching with automatic download, integrity checks, caching, and direct loading of dataset files into the app.
  • Chores

    • Bumped package version to 0.3.6.
    • Raised minimum Python requirement to 3.11.
    • Updated torchgmm to >=0.1.4 and re-added spatialdata and spatialdata-plot dependencies.
  • Tests

    • More robust test fixtures and download handling for dataset-based tests.

@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

Adds a structured dataset download/resolution pipeline (DatasetMetadata, HDF5 validation, FigShare URL normalization, safe downloading, cache resolution) for codex_mouse_spleen; updates packaging (version, Python requirement, torchgmm, re-adds spatialdata deps) and CI Python matrix; tests update to use local HDF5 fixture handling.

Changes

Cohort / File(s) Summary
Configuration & Dependencies
pyproject.toml, .github/workflows/test.yaml
Bump package version to 0.3.6; requires-python changed to >=3.11,<3.14; torchgmm requirement raised to >=0.1.4; spatialdata and spatialdata-plot re-added; GitHub Actions matrix updated to Python 3.11–3.13.
Dataset System Refactor
src/cellcharter/datasets/_dataset.py
Introduce DatasetMetadata dataclass and a download/resolution pipeline: cache dir resolution, FigShare URL normalization, HDF5 header validation, safe temp-file downloads with optional tqdm, path resolution and orchestration. codex_mouse_spleen now fetches via this pipeline and returns an anndata.AnnData.
Tests & Fixtures
tests/conftest.py
Add HDF5 validation and download helpers for test fixture (_is_hdf5_file, _download_codex); codex_adata fixture now ensures local .h5ad validity, downloads if needed, and handles HTTP/URL/OSError retries and skipping logic.

Sequence Diagram

sequenceDiagram
    participant User as "User"
    participant CMS as "codex_mouse_spleen"
    participant Fetch as "_fetch_dataset_file"
    participant Resolve as "_resolve_dataset_path"
    participant Download as "_download_file"
    participant Validate as "_is_hdf5_file"
    participant Read as "anndata.read_h5ad"
    participant FS as "FileSystem"

    User->>CMS: call(path, force_download)
    CMS->>Fetch: orchestrate download/resolution
    Fetch->>Resolve: determine full path (metadata or direct)
    Resolve->>FS: check cache / path existence
    FS-->>Resolve: path or missing
    alt cached and valid
        Resolve-->>Fetch: return cached path
    else download needed
        Fetch->>Download: request download (FigShare normalization)
        Download->>FS: write temp file (show progress)
        Download->>Validate: check HDF5 header
        Validate-->>Download: valid / invalid
        Download->>FS: move temp -> final path
        Download-->>Fetch: return downloaded path
    end
    Fetch-->>CMS: dataset file path
    CMS->>Read: anndata.read_h5ad(path)
    Read-->>CMS: AnnData
    CMS-->>User: AnnData object
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped to FigShare, nose to the sky,
I fetched little files and watched bytes fly,
I checked every header, kept caches neat,
I handed you AnnData, tidy and sweet,
A tiny rabbit's hop — dataset complete.

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: implementing a standalone dataset system to fix dataset import failures.
Linked Issues check ✅ Passed The PR successfully addresses issue #110 by removing squidpy dependency and implementing a standalone dataset download/validation system across multiple files.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing the standalone dataset system: version bumps, dependency updates, new dataset infrastructure, and test fixture updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 110-dataset-import

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
tests/conftest.py (2)

25-42: Partial .part file not cleaned up if the download itself raises.

If urlopen succeeds but response.read() fails mid-stream (timeout, connection reset), the .part file is left on disk. While the next invocation would overwrite it, wrapping the download in try/finally keeps the filesystem tidy.

Also, per Ruff S310, consider validating the URL scheme before opening, as a defensive measure.

♻️ Proposed fix
 def _download_codex(path: Path) -> None:
     path.parent.mkdir(parents=True, exist_ok=True)
     tmp_path = path.with_suffix(f"{path.suffix}.part")
+    if not _CODEX_URL.startswith(("https://", "http://")):
+        raise ValueError(f"Unexpected URL scheme: {_CODEX_URL}")
     request = Request(_CODEX_URL, headers={"User-Agent": "cellcharter-tests"})
 
-    with urlopen(request, timeout=60) as response, tmp_path.open("wb") as output:
-        chunk_size = 1024 * 1024
-        while True:
-            chunk = response.read(chunk_size)
-            if not chunk:
-                break
-            output.write(chunk)
+    try:
+        with urlopen(request, timeout=60) as response, tmp_path.open("wb") as output:  # noqa: S310
+            chunk_size = 1024 * 1024
+            while True:
+                chunk = response.read(chunk_size)
+                if not chunk:
+                    break
+                output.write(chunk)
+    except Exception:
+        tmp_path.unlink(missing_ok=True)
+        raise
 
     if not _is_hdf5_file(tmp_path):
         tmp_path.unlink(missing_ok=True)

58-75: Good retry logic with session scope.

One edge case: a corrupt-but-valid-header HDF5 file could pass _is_hdf5_file but cause ad.read_h5ad to raise an exception outside (HTTPError, URLError, OSError) (e.g., KeyError from h5py). This would bypass the retry/cleanup logic and fail the test outright. Consider broadening the except clause to Exception (or at least adding KeyError) with the same cleanup/retry behavior, or alternatively validating with ad.read_h5ad inside the HDF5 check.

♻️ Suggested broadening of the except clause
-        except (HTTPError, URLError, OSError) as e:
+        except Exception as e:
             _CODEX_PATH.unlink(missing_ok=True)  # Force re-download on next attempt

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pyproject.toml`:
- Around line 13-16: The inline comment on the requires-python metadata is stale
and misspells "multiscale": update or remove the comment after the
requires-python = ">=3.11,<3.14" line so it accurately reflects that 3.13 is
already included and fix the typo "multicale" to "multiscale" if keeping a note
(e.g., change to '# include 3.13 when spatialdata/multiscale-spatial-image
constraints require it' or simply remove the comment entirely); edit the comment
near the requires-python entry in pyproject.toml accordingly.
🧹 Nitpick comments (4)
pyproject.toml (1)

96-96: target-version for Black is stale.

target-version is set to ['py38'] but the minimum Python version is now 3.11. This won't break anything but means Black won't apply newer syntax transformations it could.

Proposed fix
-target-version = ['py38']
+target-version = ['py311']
src/cellcharter/datasets/_dataset.py (3)

56-93: Several small improvements for _download_file.

  1. URL scheme audit (S310): urlopen will happily follow file:// or other schemes. Since the URL source is currently hardcoded metadata this is low risk, but adding a scheme check is cheap insurance, especially if this is ever reused with user-supplied URLs.

  2. Invalid noqa directive on line 77: B902 is not a valid flake8-bugbear / ruff rule code. The directive should be removed.

  3. raise error on line 82: Prefer bare raise to preserve the full traceback chain without repeating the exception name.

Proposed fixes
 def _download_file(url: str, destination: Path) -> None:
+    normalized_url = _normalize_figshare_url(url)
+    if urlparse(normalized_url).scheme not in ("http", "https"):
+        raise ValueError(f"Only http/https URLs are supported, got: {normalized_url}")
     destination.parent.mkdir(parents=True, exist_ok=True)
     tmp_destination = destination.with_suffix(f"{destination.suffix}.part")
 
-    request = Request(_normalize_figshare_url(url), headers={"User-Agent": "cellcharter-datasets"})
+    request = Request(normalized_url, headers={"User-Agent": "cellcharter-datasets"})
     progress = None
     try:
         with urlopen(request, timeout=60) as response, tmp_destination.open("wb") as output:
             ...
-    except Exception as error:  # noqa: B902
+    except Exception as error:
             ...
         if isinstance(error, (HTTPError, URLError)):
             raise RuntimeError(f"Failed to download dataset from {url}.") from error
-        raise error
+        raise
     finally:

48-53: _normalize_figshare_url — consider adding a brief docstring.

The logic is correct: it converts figshare.com/ndownloader/files/… to the ndownloader.figshare.com/files/… canonical download endpoint. However, the purpose isn't immediately obvious from the code alone. A one-line docstring would help future maintainers.

Also note the currently hardcoded _codex_mouse_spleen.url already uses the ndownloader.figshare.com form, so this function only activates if someone later supplies the alternate figshare.com/ndownloader/… variant.


96-100: Suffix-based heuristic in _resolve_dataset_path can misidentify directories.

base_dir.suffix returns a non-empty string for any path with a dot in its final component (e.g., a directory named my.datasets). This would cause the function to treat it as a file path and skip appending the filename.

Since this is an internal helper with a well-documented path parameter, the risk is low, but worth a brief comment or a slightly more robust check (e.g., checking if the path already exists as a directory).

Slightly more robust alternative
 def _resolve_dataset_path(metadata: DatasetMetadata, path: str | Path | None = None) -> Path:
     base_dir = _default_datasets_dir() if path is None else Path(path)
-    if base_dir.suffix:
+    if base_dir.suffix and not base_dir.is_dir():
         return base_dir
     return base_dir / metadata.filename

- Introduced _is_hdf5_file and _download_codex functions to handle HDF5 file validation and downloading.
- Updated codex_adata fixture to check for the existence of the codex_adata file and download it if necessary.
- Enhanced error handling to include URLError and OSError during data retrieval.
@marcovarrone marcovarrone merged commit 6dea763 into main Feb 12, 2026
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Datasets import fails

1 participant