Fix datasets import failing by developing standalone dataset system#111
Fix datasets import failing by developing standalone dataset system#111marcovarrone merged 8 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughAdds a structured dataset download/resolution pipeline (DatasetMetadata, HDF5 validation, FigShare URL normalization, safe downloading, cache resolution) for codex_mouse_spleen; updates packaging (version, Python requirement, torchgmm, re-adds spatialdata deps) and CI Python matrix; tests update to use local HDF5 fixture handling. Changes
Sequence DiagramsequenceDiagram
participant User as "User"
participant CMS as "codex_mouse_spleen"
participant Fetch as "_fetch_dataset_file"
participant Resolve as "_resolve_dataset_path"
participant Download as "_download_file"
participant Validate as "_is_hdf5_file"
participant Read as "anndata.read_h5ad"
participant FS as "FileSystem"
User->>CMS: call(path, force_download)
CMS->>Fetch: orchestrate download/resolution
Fetch->>Resolve: determine full path (metadata or direct)
Resolve->>FS: check cache / path existence
FS-->>Resolve: path or missing
alt cached and valid
Resolve-->>Fetch: return cached path
else download needed
Fetch->>Download: request download (FigShare normalization)
Download->>FS: write temp file (show progress)
Download->>Validate: check HDF5 header
Validate-->>Download: valid / invalid
Download->>FS: move temp -> final path
Download-->>Fetch: return downloaded path
end
Fetch-->>CMS: dataset file path
CMS->>Read: anndata.read_h5ad(path)
Read-->>CMS: AnnData
CMS-->>User: AnnData object
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pyproject.toml`:
- Around line 13-16: The inline comment on the requires-python metadata is stale
and misspells "multiscale": update or remove the comment after the
requires-python = ">=3.11,<3.14" line so it accurately reflects that 3.13 is
already included and fix the typo "multicale" to "multiscale" if keeping a note
(e.g., change to '# include 3.13 when spatialdata/multiscale-spatial-image
constraints require it' or simply remove the comment entirely); edit the comment
near the requires-python entry in pyproject.toml accordingly.
🧹 Nitpick comments (4)
pyproject.toml (1)
96-96:target-versionfor Black is stale.
target-versionis set to['py38']but the minimum Python version is now 3.11. This won't break anything but means Black won't apply newer syntax transformations it could.Proposed fix
-target-version = ['py38'] +target-version = ['py311']src/cellcharter/datasets/_dataset.py (3)
56-93: Several small improvements for_download_file.
URL scheme audit (S310):
urlopenwill happily followfile://or other schemes. Since the URL source is currently hardcoded metadata this is low risk, but adding a scheme check is cheap insurance, especially if this is ever reused with user-supplied URLs.Invalid
noqadirective on line 77:B902is not a valid flake8-bugbear / ruff rule code. The directive should be removed.
raise erroron line 82: Prefer bareraiseto preserve the full traceback chain without repeating the exception name.Proposed fixes
def _download_file(url: str, destination: Path) -> None: + normalized_url = _normalize_figshare_url(url) + if urlparse(normalized_url).scheme not in ("http", "https"): + raise ValueError(f"Only http/https URLs are supported, got: {normalized_url}") destination.parent.mkdir(parents=True, exist_ok=True) tmp_destination = destination.with_suffix(f"{destination.suffix}.part") - request = Request(_normalize_figshare_url(url), headers={"User-Agent": "cellcharter-datasets"}) + request = Request(normalized_url, headers={"User-Agent": "cellcharter-datasets"}) progress = None try: with urlopen(request, timeout=60) as response, tmp_destination.open("wb") as output: ... - except Exception as error: # noqa: B902 + except Exception as error: ... if isinstance(error, (HTTPError, URLError)): raise RuntimeError(f"Failed to download dataset from {url}.") from error - raise error + raise finally:
48-53:_normalize_figshare_url— consider adding a brief docstring.The logic is correct: it converts
figshare.com/ndownloader/files/…to thendownloader.figshare.com/files/…canonical download endpoint. However, the purpose isn't immediately obvious from the code alone. A one-line docstring would help future maintainers.Also note the currently hardcoded
_codex_mouse_spleen.urlalready uses thendownloader.figshare.comform, so this function only activates if someone later supplies the alternatefigshare.com/ndownloader/…variant.
96-100: Suffix-based heuristic in_resolve_dataset_pathcan misidentify directories.
base_dir.suffixreturns a non-empty string for any path with a dot in its final component (e.g., a directory namedmy.datasets). This would cause the function to treat it as a file path and skip appending the filename.Since this is an internal helper with a well-documented
pathparameter, the risk is low, but worth a brief comment or a slightly more robust check (e.g., checking if the path already exists as a directory).Slightly more robust alternative
def _resolve_dataset_path(metadata: DatasetMetadata, path: str | Path | None = None) -> Path: base_dir = _default_datasets_dir() if path is None else Path(path) - if base_dir.suffix: + if base_dir.suffix and not base_dir.is_dir(): return base_dir return base_dir / metadata.filename
- Introduced _is_hdf5_file and _download_codex functions to handle HDF5 file validation and downloading. - Updated codex_adata fixture to check for the existence of the codex_adata file and download it if necessary. - Enhanced error handling to include URLError and OSError during data retrieval.
Closed issues
Summary by CodeRabbit
New Features
Chores
Tests